Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeMuscles in Action
Human motion is created by, and constrained by, our muscles. We take a first step at building computer vision methods that represent the internal muscle activity that causes motion. We present a new dataset, Muscles in Action (MIA), to learn to incorporate muscle activity into human motion representations. The dataset consists of 12.5 hours of synchronized video and surface electromyography (sEMG) data of 10 subjects performing various exercises. Using this dataset, we learn a bidirectional representation that predicts muscle activation from video, and conversely, reconstructs motion from muscle activation. We evaluate our model on in-distribution subjects and exercises, as well as on out-of-distribution subjects and exercises. We demonstrate how advances in modeling both modalities jointly can serve as conditioning for muscularly consistent motion generation. Putting muscles into computer vision systems will enable richer models of virtual humans, with applications in sports, fitness, and AR/VR.
emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography
Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at https://github.com/facebookresearch/emg2qwerty.
Pattern Discovery in Time Series with Byte Pair Encoding
The growing popularity of wearable sensors has generated large quantities of temporal physiological and activity data. Ability to analyze this data offers new opportunities for real-time health monitoring and forecasting. However, temporal physiological data presents many analytic challenges: the data is noisy, contains many missing values, and each series has a different length. Most methods proposed for time series analysis and classification do not handle datasets with these characteristics nor do they offer interpretability and explainability, a critical requirement in the health domain. We propose an unsupervised method for learning representations of time series based on common patterns identified within them. The patterns are, interpretable, variable in length, and extracted using Byte Pair Encoding compression technique. In this way the method can capture both long-term and short-term dependencies present in the data. We show that this method applies to both univariate and multivariate time series and beats state-of-the-art approaches on a real world dataset collected from wearable sensors.
Real-Time Fitness Exercise Classification and Counting from Video Frames
This paper introduces a novel method for real-time exercise classification using a Bidirectional Long Short-Term Memory (BiLSTM) neural network. Existing exercise recognition approaches often rely on synthetic datasets, raw coordinate inputs sensitive to user and camera variations, and fail to fully exploit the temporal dependencies in exercise movements. These issues limit their generalizability and robustness in real-world conditions, where lighting, camera angles, and user body types vary. To address these challenges, we propose a BiLSTM-based model that leverages invariant features, such as joint angles, alongside raw coordinates. By using both angles and (x, y, z) coordinates, the model adapts to changes in perspective, user positioning, and body differences, improving generalization. Training on 30-frame sequences enables the BiLSTM to capture the temporal context of exercises and recognize patterns evolving over time. We compiled a dataset combining synthetic data from the InfiniteRep dataset and real-world videos from Kaggle and other sources. This dataset includes four common exercises: squat, push-up, shoulder press, and bicep curl. The model was trained and validated on these diverse datasets, achieving an accuracy of over 99% on the test set. To assess generalizability, the model was tested on 2 separate test sets representative of typical usage conditions. Comparisons with the previous approach from the literature are present in the result section showing that the proposed model is the best-performing one. The classifier is integrated into a web application providing real-time exercise classification and repetition counting without manual exercise selection. Demo and datasets are available at the following GitHub Repository: https://github.com/RiccardoRiccio/Fitness-AI-Trainer-With-Automatic-Exercise-Recognition-and-Counting.
FLEX: A Large-Scale Multi-Modal Multi-Action Dataset for Fitness Action Quality Assessment
With the increasing awareness of health and the growing desire for aesthetic physique, fitness has become a prevailing trend. However, the potential risks associated with fitness training, especially with weight-loaded fitness actions, cannot be overlooked. Action Quality Assessment (AQA), a technology that quantifies the quality of human action and provides feedback, holds the potential to assist fitness enthusiasts of varying skill levels in achieving better training outcomes. Nevertheless, current AQA methodologies and datasets are limited to single-view competitive sports scenarios and RGB modality and lack professional assessment and guidance of fitness actions. To address this gap, we propose the FLEX dataset, the first multi-modal, multi-action, large-scale dataset that incorporates surface electromyography (sEMG) signals into AQA. FLEX utilizes high-precision MoCap to collect 20 different weight-loaded actions performed by 38 subjects across 3 different skill levels for 10 repetitions each, containing 5 different views of the RGB video, 3D pose, sEMG, and physiological information. Additionally, FLEX incorporates knowledge graphs into AQA, constructing annotation rules in the form of penalty functions that map weight-loaded actions, action keysteps, error types, and feedback. We conducted various baseline methodologies on FLEX, demonstrating that multimodal data, multiview data, and fine-grained annotations significantly enhance model performance. FLEX not only advances AQA methodologies and datasets towards multi-modal and multi-action scenarios but also fosters the integration of artificial intelligence within the fitness domain. Dataset and code are available at https://haoyin116.github.io/FLEX_Dataset.
Volitional Control of the Paretic Hand Post-Stroke Increases Finger Stiffness and Resistance to Robot-Assisted Movement
Increased effort during use of the paretic arm and hand can provoke involuntary abnormal synergy patterns and amplify stiffness effects of muscle tone for individuals after stroke, which can add difficulty for user-controlled devices to assist hand movement during functional tasks. We study how volitional effort, exerted in an attempt to open or close the hand, affects resistance to robot-assisted movement at the finger level. We perform experiments with three chronic stroke survivors to measure changes in stiffness when the user is actively exerting effort to activate ipsilateral EMG-controlled robot-assisted hand movements, compared with when the fingers are passively stretched, as well as overall effects from sustained active engagement and use. Our results suggest that active engagement of the upper extremity increases muscle tone in the finger to a much greater degree than through passive-stretch or sustained exertion over time. Potential design implications of this work suggest that developers should anticipate higher levels of finger stiffness when relying on user-driven ipsilateral control methods for assistive or rehabilitative devices for stroke.
BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis
Human motion generation holds significant promise in fields such as animation, film production, and robotics. However, existing methods often fail to produce physically plausible movements that adhere to biomechanical principles. While recent autoregressive and diffusion models have improved visual quality, they frequently overlook essential biodynamic features, such as muscle activation patterns and joint coordination, leading to motions that either violate physical laws or lack controllability. This paper introduces BioMoDiffuse, a novel biomechanics-aware diffusion framework that addresses these limitations. It features three key innovations: (1) A lightweight biodynamic network that integrates muscle electromyography (EMG) signals and kinematic features with acceleration constraints, (2) A physics-guided diffusion process that incorporates real-time biomechanical verification via modified Euler-Lagrange equations, and (3) A decoupled control mechanism that allows independent regulation of motion speed and semantic context. We also propose a set of comprehensive evaluation protocols that combines traditional metrics (FID, R-precision, etc.) with new biomechanical criteria (smoothness, foot sliding, floating, etc.). Our approach bridges the gap between data-driven motion synthesis and biomechanical authenticity, establishing new benchmarks for physically accurate motion generation.
Data-Driven Goal Recognition in Transhumeral Prostheses Using Process Mining Techniques
A transhumeral prosthesis restores missing anatomical segments below the shoulder, including the hand. Active prostheses utilize real-valued, continuous sensor data to recognize patient target poses, or goals, and proactively move the artificial limb. Previous studies have examined how well the data collected in stationary poses, without considering the time steps, can help discriminate the goals. In this case study paper, we focus on using time series data from surface electromyography electrodes and kinematic sensors to sequentially recognize patients' goals. Our approach involves transforming the data into discrete events and training an existing process mining-based goal recognition system. Results from data collected in a virtual reality setting with ten subjects demonstrate the effectiveness of our proposed goal recognition approach, which achieves significantly better precision and recall than the state-of-the-art machine learning techniques and is less confident when wrong, which is beneficial when approximating smoother movements of prostheses.
UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language
We introduce UbiPhysio, a milestone framework that delivers fine-grained action description and feedback in natural language to support people's daily functioning, fitness, and rehabilitation activities. This expert-like capability assists users in properly executing actions and maintaining engagement in remote fitness and rehabilitation programs. Specifically, the proposed UbiPhysio framework comprises a fine-grained action descriptor and a knowledge retrieval-enhanced feedback module. The action descriptor translates action data, represented by a set of biomechanical movement features we designed based on clinical priors, into textual descriptions of action types and potential movement patterns. Building on physiotherapeutic domain knowledge, the feedback module provides clear and engaging expert feedback. We evaluated UbiPhysio's performance through extensive experiments with data from 104 diverse participants, collected in a home-like setting during 25 types of everyday activities and exercises. We assessed the quality of the language output under different tuning strategies using standard benchmarks. We conducted a user study to gather insights from clinical physiotherapists and potential users about our framework. Our initial tests show promise for deploying UbiPhysio in real-life settings without specialized devices.
Upper Limb Movement Recognition utilising EEG and EMG Signals for Rehabilitative Robotics
Upper limb movement classification, which maps input signals to the target activities, is a key building block in the control of rehabilitative robotics. Classifiers are trained for the rehabilitative system to comprehend the desires of the patient whose upper limbs do not function properly. Electromyography (EMG) signals and Electroencephalography (EEG) signals are used widely for upper limb movement classification. By analysing the classification results of the real-time EEG and EMG signals, the system can understand the intention of the user and predict the events that one would like to carry out. Accordingly, it will provide external help to the user. However, the noise in the real-time EEG and EMG data collection process contaminates the effectiveness of the data, which undermines classification performance. Moreover, not all patients process strong EMG signals due to muscle damage and neuromuscular disorder. To address these issues, this paper explores different feature extraction techniques and machine learning and deep learning models for EEG and EMG signals classification and proposes a novel decision-level multisensor fusion technique to integrate EEG signals with EMG signals. This system retrieves effective information from both sources to understand and predict the desire of the user, and thus aid. By testing out the proposed technique on a publicly available WAY-EEG-GAL dataset, which contains EEG and EMG signals that were recorded simultaneously, we manage to conclude the feasibility and effectiveness of the novel system.
Pain level and pain-related behaviour classification using GRU-based sparsely-connected RNNs
There is a growing body of studies on applying deep learning to biometrics analysis. Certain circumstances, however, could impair the objective measures and accuracy of the proposed biometric data analysis methods. For instance, people with chronic pain (CP) unconsciously adapt specific body movements to protect themselves from injury or additional pain. Because there is no dedicated benchmark database to analyse this correlation, we considered one of the specific circumstances that potentially influence a person's biometrics during daily activities in this study and classified pain level and pain-related behaviour in the EmoPain database. To achieve this, we proposed a sparsely-connected recurrent neural networks (s-RNNs) ensemble with the gated recurrent unit (GRU) that incorporates multiple autoencoders using a shared training framework. This architecture is fed by multidimensional data collected from inertial measurement unit (IMU) and surface electromyography (sEMG) sensors. Furthermore, to compensate for variations in the temporal dimension that may not be perfectly represented in the latent space of s-RNNs, we fused hand-crafted features derived from information-theoretic approaches with represented features in the shared hidden state. We conducted several experiments which indicate that the proposed method outperforms the state-of-the-art approaches in classifying both pain level and pain-related behaviour.
HTNet for micro-expression recognition
Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: https://github.com/wangzhifengharrison/HTNet
PhysiQ: Off-site Quality Assessment of Exercise in Physical Therapy
Physical therapy (PT) is crucial for patients to restore and maintain mobility, function, and well-being. Many on-site activities and body exercises are performed under the supervision of therapists or clinicians. However, the postures of some exercises at home cannot be performed accurately due to the lack of supervision, quality assessment, and self-correction. Therefore, in this paper, we design a new framework, PhysiQ, that continuously tracks and quantitatively measures people's off-site exercise activity through passive sensory detection. In the framework, we create a novel multi-task spatio-temporal Siamese Neural Network that measures the absolute quality through classification and relative quality based on an individual's PT progress through similarity comparison. PhysiQ digitizes and evaluates exercises in three different metrics: range of motions, stability, and repetition.
Fatigue-PINN: Physics-Informed Fatigue-Driven Motion Modulation and Synthesis
Fatigue modeling is essential for motion synthesis tasks to model human motions under fatigued conditions and biomechanical engineering applications, such as investigating the variations in movement patterns and posture due to fatigue, defining injury risk mitigation and prevention strategies, formulating fatigue minimization schemes and creating improved ergonomic designs. Nevertheless, employing data-driven methods for synthesizing the impact of fatigue on motion, receives little to no attention in the literature. In this work, we present Fatigue-PINN, a deep learning framework based on Physics-Informed Neural Networks, for modeling fatigued human movements, while providing joint-specific fatigue configurations for adaptation and mitigation of motion artifacts on a joint level, resulting in more realistic animations. To account for muscle fatigue, we simulate the fatigue-induced fluctuations in the maximum exerted joint torques by leveraging a PINN adaptation of the Three-Compartment Controller model to exploit physics-domain knowledge for improving accuracy. This model also introduces parametric motion alignment with respect to joint-specific fatigue, hence avoiding sharp frame transitions. Our results indicate that Fatigue-PINN accurately simulates the effects of externally perceived fatigue on open-type human movements being consistent with findings from real-world experimental fatigue studies. Since fatigue is incorporated in torque space, Fatigue-PINN provides an end-to-end encoder-decoder-like architecture, to ensure transforming joint angles to joint torques and vice-versa, thus, being compatible with motion synthesis frameworks operating on joint angles.
High-density Electromyography for Effective Gesture-based Control of Physically Assistive Mobile Manipulators
Injury to the cervical spinal cord can cause quadriplegia, impairing muscle function in all four limbs. People with impaired hand function and mobility encounter significant difficulties in carrying out essential self-care and household tasks. Despite the impairment of their neural drive, their volitional myoelectric activity is often partially preserved. High-density electromyography (HDEMG) can detect this myoelectric activity, which can serve as control inputs to assistive devices. Previous HDEMG-controlled robotic interfaces have primarily been limited to controlling table-mounted robot arms. These have constrained reach capabilities. Instead, the ability to control mobile manipulators, which have no such workspace constraints, could allow individuals with quadriplegia to perform a greater variety of assistive tasks, thus restoring independence and reducing caregiver workload. In this study, we introduce a non-invasive wearable HDEMG interface with real-time myoelectric hand gesture recognition, enabling both coarse and fine control over the intricate mobility and manipulation functionalities of an 8 degree-of-freedom mobile manipulator. Our evaluation, involving 13 participants engaging in challenging self-care and household activities, demonstrates the potential of our wearable HDEMG system to profoundly enhance user independence by enabling non-invasive control of a mobile manipulator.
Decoding Human Activities: Analyzing Wearable Accelerometer and Gyroscope Data for Activity Recognition
A person's movement or relative positioning effectively generates raw electrical signals that can be read by computing machines to apply various manipulative techniques for the classification of different human activities. In this paper, a stratified multi-structural approach based on a Residual network ensembled with Residual MobileNet is proposed, termed as FusionActNet. The proposed method involves using carefully designed Residual blocks for classifying the static and dynamic activities separately because they have clear and distinct characteristics that set them apart. These networks are trained independently, resulting in two specialized and highly accurate models. These models excel at recognizing activities within a specific superclass by taking advantage of the unique algorithmic benefits of architectural adjustments. Afterward, these two ResNets are passed through a weighted ensemble-based Residual MobileNet. Subsequently, this ensemble proficiently discriminates between a specific static and a specific dynamic activity, which were previously identified based on their distinct feature characteristics in the earlier stage. The proposed model is evaluated using two publicly accessible datasets; namely, UCI HAR and Motion-Sense. Therein, it successfully handled the highly confusing cases of data overlap. Therefore, the proposed approach achieves a state-of-the-art accuracy of 96.71% and 95.35% in the UCI HAR and Motion-Sense datasets respectively.
MyoDex: A Generalizable Prior for Dexterous Manipulation
Human dexterity is a hallmark of motor control. Our hands can rapidly synthesize new behaviors despite the complexity (multi-articular and multi-joints, with 23 joints controlled by more than 40 muscles) of musculoskeletal sensory-motor circuits. In this work, we take inspiration from how human dexterity builds on a diversity of prior experiences, instead of being acquired through a single task. Motivated by this observation, we set out to develop agents that can build upon their previous experience to quickly acquire new (previously unattainable) behaviors. Specifically, our approach leverages multi-task learning to implicitly capture task-agnostic behavioral priors (MyoDex) for human-like dexterity, using a physiologically realistic human hand model - MyoHand. We demonstrate MyoDex's effectiveness in few-shot generalization as well as positive transfer to a large repertoire of unseen dexterous manipulation tasks. Agents leveraging MyoDex can solve approximately 3x more tasks, and 4x faster in comparison to a distillation baseline. While prior work has synthesized single musculoskeletal control behaviors, MyoDex is the first generalizable manipulation prior that catalyzes the learning of dexterous physiological control across a large variety of contact-rich behaviors. We also demonstrate the effectiveness of our paradigms beyond musculoskeletal control towards the acquisition of dexterity in 24 DoF Adroit Hand. Website: https://sites.google.com/view/myodex
SayTap: Language to Quadrupedal Locomotion
Large language models (LLMs) have demonstrated the potential to perform high-level planning. Yet, it remains a challenge for LLMs to comprehend low-level commands, such as joint angle targets or motor torques. This paper proposes an approach to use foot contact patterns as an interface that bridges human commands in natural language and a locomotion controller that outputs these low-level commands. This results in an interactive system for quadrupedal robots that allows the users to craft diverse locomotion behaviors flexibly. We contribute an LLM prompt design, a reward function, and a method to expose the controller to the feasible distribution of contact patterns. The results are a controller capable of achieving diverse locomotion patterns that can be transferred to real robot hardware. Compared with other design choices, the proposed approach enjoys more than 50% success rate in predicting the correct contact patterns and can solve 10 more tasks out of a total of 30 tasks. Our project site is: https://saytap.github.io.
SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation
Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.
A Medical Low-Back Pain Physical Rehabilitation Dataset for Human Body Movement Analysis
While automatic monitoring and coaching of exercises are showing encouraging results in non-medical applications, they still have limitations such as errors and limited use contexts. To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we identify in this article four challenges to address and propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises. The dataset includes 3D Kinect skeleton positions and orientations, RGB videos, 2D skeleton data, and medical annotations to assess the correctness, and error classification and localisation of body part and timespan. Along this dataset, we perform a complete research path, from data collection to processing, and finally a small benchmark. We evaluated on the dataset two baseline movement recognition algorithms, pertaining to two different approaches: the probabilistic approach with a Gaussian Mixture Model (GMM), and the deep learning approach with a Long-Short Term Memory (LSTM). This dataset is valuable because it includes rehabilitation relevant motions in a clinical setting with patients in their rehabilitation program, using a cost-effective, portable, and convenient sensor, and because it shows the potential for improvement on these challenges.
Action in Mind: A Neural Network Approach to Action Recognition and Segmentation
Recognizing and categorizing human actions is an important task with applications in various fields such as human-robot interaction, video analysis, surveillance, video retrieval, health care system and entertainment industry. This thesis presents a novel computational approach for human action recognition through different implementations of multi-layer architectures based on artificial neural networks. Each system level development is designed to solve different aspects of the action recognition problem including online real-time processing, action segmentation and the involvement of objects. The analysis of the experimental results are illustrated and described in six articles. The proposed action recognition architecture of this thesis is composed of several processing layers including a preprocessing layer, an ordered vector representation layer and three layers of neural networks. It utilizes self-organizing neural networks such as Kohonen feature maps and growing grids as the main neural network layers. Thus the architecture presents a biological plausible approach with certain features such as topographic organization of the neurons, lateral interactions, semi-supervised learning and the ability to represent high dimensional input space in lower dimensional maps. For each level of development the system is trained with the input data consisting of consecutive 3D body postures and tested with generalized input data that the system has never met before. The experimental results of different system level developments show that the system performs well with quite high accuracy for recognizing human actions.
CAPTURE-24: A large dataset of wrist-worn activity tracker data collected in the wild for human activity recognition
Existing activity tracker datasets for human activity recognition are typically obtained by having participants perform predefined activities in an enclosed environment under supervision. This results in small datasets with a limited number of activities and heterogeneity, lacking the mixed and nuanced movements normally found in free-living scenarios. As such, models trained on laboratory-style datasets may not generalise out of sample. To address this problem, we introduce a new dataset involving wrist-worn accelerometers, wearable cameras, and sleep diaries, enabling data collection for over 24 hours in a free-living setting. The result is CAPTURE-24, a large activity tracker dataset collected in the wild from 151 participants, amounting to 3883 hours of accelerometer data, of which 2562 hours are annotated. CAPTURE-24 is two to three orders of magnitude larger than existing publicly available datasets, which is critical to developing accurate human activity recognition models.
Learning to Brachiate via Simplified Model Imitation
Brachiation is the primary form of locomotion for gibbons and siamangs, in which these primates swing from tree limb to tree limb using only their arms. It is challenging to control because of the limited control authority, the required advance planning, and the precision of the required grasps. We present a novel approach to this problem using reinforcement learning, and as demonstrated on a finger-less 14-link planar model that learns to brachiate across challenging handhold sequences. Key to our method is the use of a simplified model, a point mass with a virtual arm, for which we first learn a policy that can brachiate across handhold sequences with a prescribed order. This facilitates the learning of the policy for the full model, for which it provides guidance by providing an overall center-of-mass trajectory to imitate, as well as for the timing of the holds. Lastly, the simplified model can also readily be used for planning suitable sequences of handholds in a given environment. Our results demonstrate brachiation motions with a variety of durations for the flight and hold phases, as well as emergent extra back-and-forth swings when this proves useful. The system is evaluated with a variety of ablations. The method enables future work towards more general 3D brachiation, as well as using simplified model imitation in other settings.
Generative Action Description Prompts for Skeleton-based Action Recognition
Skeleton-based action recognition has recently received considerable attention. Current approaches to skeleton-based action recognition are typically formulated as one-hot classification tasks and do not fully exploit the semantic relations between actions. For example, "make victory sign" and "thumb up" are two actions of hand gestures, whose major difference lies in the movement of hands. This information is agnostic from the categorical one-hot encoding of action classes but could be unveiled from the action description. Therefore, utilizing action description in training could potentially benefit representation learning. In this work, we propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition. More specifically, we employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. Experiments show that our proposed GAP method achieves noticeable improvements over various baseline models without extra computation cost at inference. GAP achieves new state-of-the-arts on popular skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The source code is available at https://github.com/MartinXM/GAP.
RISE Controller Tuning and System Identification Through Machine Learning for Human Lower Limb Rehabilitation via Neuromuscular Electrical Stimulation
Neuromuscular electrical stimulation (NMES) has been effectively applied in many rehabilitation treatments of individuals with spinal cord injury (SCI). In this context, we introduce a novel, robust, and intelligent control-based methodology to closed-loop NMES systems. Our approach utilizes a robust control law to guarantee system stability and machine learning tools to optimize both the controller parameters and system identification. Regarding the latter, we introduce the use of past rehabilitation data to build more realistic data-driven identified models. Furthermore, we apply the proposed methodology for the rehabilitation of lower limbs using a control technique named the robust integral of the sign of the error (RISE), an offline improved genetic algorithm optimizer, and neural network models. Although in the literature, the RISE controller presented good results on healthy subjects, without any fine-tuning method, a trial and error approach would quickly lead to muscle fatigue for individuals with SCI. In this paper, for the first time, the RISE controller is evaluated with two paraplegic subjects in one stimulation session and with seven healthy individuals in at least two and at most five sessions. The results showed that the proposed approach provided a better control performance than empirical tuning, which can avoid premature fatigue on NMES-based clinical procedures.
3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding
The increasing popularity of exercises including yoga and Pilates has created a greater demand for professional exercise video datasets in the realm of artificial intelligence. In this study, we developed 3DYoga901, which is organized within a three-level label hierarchy. We have expanded the number of poses from an existing state-of-the-art dataset, increasing it from 82 to 90 poses. Our dataset includes meticulously curated RGB yoga pose videos and 3D skeleton sequences. This dataset was created by a dedicated team of six individuals, including yoga instructors. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources. This contribution has the potential to significantly advance the field of yoga action recognition and pose assessment. Additionally, we conducted experiments to evaluate the practicality of our proposed dataset. We employed three different model variants for benchmarking purposes.
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting
Counting repetitive actions are widely seen in human activities such as physical exercise. Existing methods focus on performing repetitive action counting in short videos, which is tough for dealing with longer videos in more realistic scenarios. In the data-driven era, the degradation of such generalization capability is mainly attributed to the lack of long video datasets. To complement this margin, we introduce a new large-scale repetitive action counting dataset covering a wide variety of video lengths, along with more realistic situations where action interruption or action inconsistencies occur in the video. Besides, we also provide a fine-grained annotation of the action cycles instead of just counting annotation along with a numerical value. Such a dataset contains 1,451 videos with about 20,000 annotations, which is more challenging. For repetitive action counting towards more realistic scenarios, we further propose encoding multi-scale temporal correlation with transformers that can take into account both performance and efficiency. Furthermore, with the help of fine-grained annotation of action cycles, we propose a density map regression-based method to predict the action period, which yields better performance with sufficient interpretability. Our proposed method outperforms state-of-the-art methods on all datasets and also achieves better performance on the unseen dataset without fine-tuning. The dataset and code are available.
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.
PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition
Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. `Thumbs up') or legs (e.g. `Kicking'). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet's scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on compute-restricted embedded and edge devices. Code and pretrained models can be accessed at https://github.com/skelemoa/psumnet
Forensic Activity Classification Using Digital Traces from iPhones: A Machine Learning-based Approach
Smartphones and smartwatches are ever-present in daily life, and provide a rich source of information on their users' behaviour. In particular, digital traces derived from the phone's embedded movement sensors present an opportunity for a forensic investigator to gain insight into a person's physical activities. In this work, we present a machine learning-based approach to translate digital traces into likelihood ratios (LRs) for different types of physical activities. Evaluating on a new dataset, NFI\_FARED, which contains digital traces from four different types of iPhones labelled with 19 activities, it was found that our approach could produce useful LR systems to distinguish 167 out of a possible 171 activity pairings. The same approach was extended to analyse likelihoods for multiple activities (or groups of activities) simultaneously and create activity timelines to aid in both the early and latter stages of forensic investigations. The dataset and all code required to replicate the results have also been made public to encourage further research on this topic.
Dimension Reduction for Characterizing Sexual Dimorphism in Biomechanics of the Temporomandibular Joint
Sexual dimorphism is a critical factor in many biological and medical research fields. In biomechanics and bioengineering, understanding sex differences is crucial for studying musculoskeletal conditions such as temporomandibular disorder (TMD). This paper focuses on the association between the craniofacial skeletal morphology and temporomandibular joint (TMJ) related masticatory muscle attachments to discern sex differences. Data were collected from 10 male and 11 female cadaver heads to investigate sex-specific relationships between the skull and muscles. We propose a conditional cross-covariance reduction (CCR) model, designed to examine the dynamic association between two sets of random variables conditioned on a third binary variable (e.g., sex), highlighting the most distinctive sex-related relationships between skull and muscle attachments in the human cadaver data. Under the CCR model, we employ a sparse singular value decomposition algorithm and introduce a sequential permutation for selecting sparsity (SPSS) method to select important variables and to determine the optimal number of selected variables.
Proving the Potential of Skeleton Based Action Recognition to Automate the Analysis of Manual Processes
In manufacturing sectors such as textiles and electronics, manual processes are a fundamental part of production. The analysis and monitoring of the processes is necessary for efficient production design. Traditional methods for analyzing manual processes are complex, expensive, and inflexible. Compared to established approaches such as Methods-Time-Measurement (MTM), machine learning (ML) methods promise: Higher flexibility, self-sufficient & permanent use, lower costs. In this work, based on a video stream, the current motion class in a manual assembly process is detected. With information on the current motion, Key-Performance-Indicators (KPIs) can be derived easily. A skeleton-based action recognition approach is taken, as this field recently shows major success in machine vision tasks. For skeleton-based action recognition in manual assembly, no sufficient pre-work could be found. Therefore, a ML pipeline is developed, to enable extensive research on different (pre-) processing methods and neural nets. Suitable well generalizing approaches are found, proving the potential of ML to enhance analyzation of manual processes. Models detect the current motion, performed by an operator in manual assembly, but the results can be transferred to all kinds of manual processes.
KinMo: Kinematic-aware Human Motion Understanding and Generation
Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: {\smallhttps://andypinxinliu.github.io/KinMo/}
Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning
Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson's disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.
Initial Investigation of Kolmogorov-Arnold Networks (KANs) as Feature Extractors for IMU Based Human Activity Recognition
In this work, we explore the use of a novel neural network architecture, the Kolmogorov-Arnold Networks (KANs) as feature extractors for sensor-based (specifically IMU) Human Activity Recognition (HAR). Where conventional networks perform a parameterized weighted sum of the inputs at each node and then feed the result into a statically defined nonlinearity, KANs perform non-linear computations represented by B-SPLINES on the edges leading to each node and then just sum up the inputs at the node. Instead of learning weights, the system learns the spline parameters. In the original work, such networks have been shown to be able to more efficiently and exactly learn sophisticated real valued functions e.g. in regression or PDE solution. We hypothesize that such an ability is also advantageous for computing low-level features for IMU-based HAR. To this end, we have implemented KAN as the feature extraction architecture for IMU-based human activity recognition tasks, including four architecture variations. We present an initial performance investigation of the KAN feature extractor on four public HAR datasets. It shows that the KAN-based feature extractor outperforms CNN-based extractors on all datasets while being more parameter efficient.
Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition
Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier's reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary "known/unknown" classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.
KAN-HAR: A Human activity recognition based on Kolmogorov-Arnold Network
Human Activity Recognition (HAR) plays a critical role in numerous applications, including healthcare monitoring, fitness tracking, and smart environments. Traditional deep learning (DL) approaches, while effective, often require extensive parameter tuning and may lack interpretability. In this work, we investigate the use of a single three-axis accelerometer and the Kolmogorov--Arnold Network (KAN) for HAR tasks, leveraging its ability to model complex nonlinear relationships with improved interpretability and parameter efficiency. The MotionSense dataset, containing smartphone-based motion sensor signals across various physical activities, is employed to evaluate the proposed approach. Our methodology involves preprocessing and normalization of accelerometer and gyroscope data, followed by KAN-based feature learning and classification. Experimental results demonstrate that the KAN achieves competitive or superior classification performance compared to conventional deep neural networks, while maintaining a significantly reduced parameter count. This highlights the potential of KAN architectures as an efficient and interpretable alternative for real-world HAR systems. The open-source implementation of the proposed framework is available at the Project's GitHub Repository.
L-SFAN: Lightweight Spatially-focused Attention Network for Pain Behavior Detection
Chronic Low Back Pain (CLBP) afflicts millions globally, significantly impacting individuals' well-being and imposing economic burdens on healthcare systems. While artificial intelligence (AI) and deep learning offer promising avenues for analyzing pain-related behaviors to improve rehabilitation strategies, current models, including convolutional neural networks (CNNs), recurrent neural networks, and graph-based neural networks, have limitations. These approaches often focus singularly on the temporal dimension or require complex architectures to exploit spatial interrelationships within multivariate time series data. To address these limitations, we introduce L-SFAN, a lightweight CNN architecture incorporating 2D filters designed to meticulously capture the spatial-temporal interplay of data from motion capture and surface electromyography sensors. Our proposed model, enhanced with an oriented global pooling layer and multi-head self-attention mechanism, prioritizes critical features to better understand CLBP and achieves competitive classification accuracy. Experimental results on the EmoPain database demonstrate that our approach not only enhances performance metrics with significantly fewer parameters but also promotes model interpretability, offering valuable insights for clinicians in managing CLBP. This advancement underscores the potential of AI in transforming healthcare practices for chronic conditions like CLBP, providing a sophisticated framework for the nuanced analysis of complex biomedical data.
Computer Vision for Clinical Gait Analysis: A Gait Abnormality Video Dataset
Clinical gait analysis (CGA) using computer vision is an emerging field in artificial intelligence that faces barriers of accessible, real-world data, and clear task objectives. This paper lays the foundation for current developments in CGA as well as vision-based methods and datasets suitable for gait analysis. We introduce The Gait Abnormality in Video Dataset (GAVD) in response to our review of over 150 current gait-related computer vision datasets, which highlighted the need for a large and accessible gait dataset clinically annotated for CGA. GAVD stands out as the largest video gait dataset, comprising 1874 sequences of normal, abnormal and pathological gaits. Additionally, GAVD includes clinically annotated RGB data sourced from publicly available content on online platforms. It also encompasses over 400 subjects who have undergone clinical grade visual screening to represent a diverse range of abnormal gait patterns, captured in various settings, including hospital clinics and urban uncontrolled outdoor environments. We demonstrate the validity of the dataset and utility of action recognition models for CGA using pretrained models Temporal Segment Networks(TSN) and SlowFast network to achieve video abnormality detection of 94% and 92% respectively when tested on GAVD dataset. A GitHub repository https://github.com/Rahmyyy/GAVD consisting of convenient URL links, and clinically relevant annotation for CGA is provided for over 450 online videos, featuring diverse subjects performing a range of normal, pathological, and abnormal gait patterns.
Pūioio: On-device Real-Time Smartphone-Based Automated Exercise Repetition Counting System
Automated exercise repetition counting has applications across the physical fitness realm, from personal health to rehabilitation. Motivated by the ubiquity of mobile phones and the benefits of tracking physical activity, this study explored the feasibility of counting exercise repetitions in real-time, using only on-device inference, on smartphones. In this work, after providing an extensive overview of the state-of-the-art automatic exercise repetition counting methods, we introduce a deep learning based exercise repetition counting system for smartphones consisting of five components: (1) Pose estimation, (2) Thresholding, (3) Optical flow, (4) State machine, and (5) Counter. The system is then implemented via a cross-platform mobile application named P\=uioio that uses only the smartphone camera to track repetitions in real time for three standard exercises: Squats, Push-ups, and Pull-ups. The proposed system was evaluated via a dataset of pre-recorded videos of individuals exercising as well as testing by subjects exercising in real time. Evaluation results indicated the system was 98.89% accurate in real-world tests and up to 98.85% when evaluated via the pre-recorded dataset. This makes it an effective, low-cost, and convenient alternative to existing solutions since the proposed system has minimal hardware requirements without requiring any wearable or specific sensors or network connectivity.
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. This dataset comprises fine-grained hierarchical annotations intended for the purpose of studying activity understanding in BioLab. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings. Finally, we provide a thorough experimental evaluation of contemporary video understanding models and highlight their limitations in this specialized domain to identify potential avenues for future research. We hope ProBio with associated benchmarks may garner increased focus on modern AI techniques in the realm of molecular biology.
