Abstract
Human Activity Recognition (HAR) leveraging wearable sensors has emerged as a critical research area, with broad applications spanning healthcare, elderly assistance, sports analytics, and human-computer interaction. While traditional approaches using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks have effectively extracted local spatial and sequential temporal features from multi-channel sensor data, recent advancements incorporate Transformer-based architectures featuring attention mechanisms that capture long-range temporal dependencies without recurrence. This paper introduces a novel multivariate Transformer model designed to integrate multiple physiological and kinematic data streams such as: electrocardioagram-ECG, photoplethysmogram-PPG (wrist and finger infrared/red), Galvanic Skin Response (GSR), respiration, body temperature, three-axis acceleration, and gyroscope signals. Distinctively, the designed architecture assigns dedicated encoders to individual streams to effectively handle signal diversity, sampling frequency variations, and latency discrepancies, using multi-head attention and learnable positional encodings. Evaluated across five experimental scenarios (rest, standing, sitting, running, and walking) segmented into uniform 30-seconds windows, the Transformer-based model demonstrated exceptional performance, achieving approximately 99% accuracy, along with near-perfect sensitivity and F1-scores, highlighting its robustness and superior generalization capability.