Publications

Preprints

Refereed Journal Articles

Yang, W., Xiao, Q., & Zhang, Y. (2024). HAR^2bot: A human-centered augmented reality robot programming method with the awareness of cognitive load. Journal of Intelligent Manufacturing, 35(5), 1985–2003.

PDF DOI Website

@article{yang2024ha,
  title = {HAR^2bot: A human-centered augmented reality robot programming method with the awareness of cognitive load},
  author = {Yang, Wenhao and Xiao, Qinqin and Zhang, Yunbo},
  journal = {Journal of Intelligent Manufacturing},
  volume = {35},
  number = {5},
  pages = {1985--2003},
  year = {2024},
  publisher = {Springer US New York},
  doi = {10.1007/s10845-023-02096-2},
  url = {https://link.springer.com/article/10.1007/s10845-023-02096-2#citeas},
  pdf = {yang2023HAR2bot.pdf}
}

In the era of Industry 4.0, manufacturing enterprises are actively adopting collaborative robots (Cobots) in their productions. Current online and offline robot programming methods are difficult to use and require extensive experience or skills. On the other hand, the manufacturing industries are experiencing a labor shortage. An essential question, therefore, is: how would a new robot programming method help novice users complete complex tasks effectively, efficiently, and intuitively? To answer this question, we proposed HAbot, a novel human-centered augmented reality programming interface with awareness of cognitive load. Using NASA’s system design theory and the cognitive load theory, a set of guidelines for designing an AR-based human-robot interaction system is obtained through a human-centered design process. Based on these guidelines, we designed and implemented a human-in-the-loop workflow with features for cognitive load management. The effectiveness and efficiency of HAbot are verified in two complex tasks compared with existing online programming methods. We also evaluated HAbot quantitatively and qualitatively through a user study with 16 participants. According to the user study, compared with existing methods, HAbot has higher efficiency, a lower overall cognitive load, lower cognitive loads for each type, and higher safety.

Yang, W., Dengxiong, X., Wang, X., Hu, Y., & Zhang, Y. (2024). “i can see your password”: A case study about cybersecurity risks in mid-air interactions of mixed reality-based smart manufacturing applications. Journal of Computing and Information Science in Engineering, 24(3), 031004.

PDF DOI Website

@article{yang2024can,
  title = {“i can see your password”: A case study about cybersecurity risks in mid-air interactions of mixed reality-based smart manufacturing applications},
  author = {Yang, Wenhao and Dengxiong, Xiwen and Wang, Xueting and Hu, Yidan and Zhang, Yunbo},
  journal = {Journal of Computing and Information Science in Engineering},
  volume = {24},
  number = {3},
  pages = {031004},
  year = {2024},
  publisher = {American Society of Mechanical Engineers},
  url = {https://asmedigitalcollection.asme.org/computingengineering/article/24/3/031004/1163679/I-Can-See-Your-Password-A-Case-Study-About?guestAccessKey=},
  doi = {10.1115/1.4062658},
  pdf = {yang2024can.pdf}
}

This paper aims to present a potential cybersecurity risk existing in mixed reality (MR)-based smart manufacturing applications that decipher digital passwords through a single RGB camera to capture the user’s mid-air gestures. We first created a test bed, which is an MR-based smart factory management system consisting of mid-air gesture-based user interfaces (UIs) on a video see-through MR head-mounted display. To interact with UIs and input information, the user’s hand movements and gestures are tracked by the MR system. We setup the experiment to be the estimation of the password input by users through mid-air hand gestures on a virtual numeric keypad. To achieve this goal, we developed a lightweight machine learning-based hand position tracking and gesture recognition method. This method takes either video streaming or recorded video clips (taken by a single RGB camera in front of the user) as input, where the videos record the users’ hand movements and gestures but not the virtual UIs. With the assumption of the known size, position, and layout of the keypad, the machine learning method estimates the password through hand gesture recognition and finger position detection. The evaluation result indicates the effectiveness of the proposed method, with a high accuracy of 97.03%, 94.06%, and 83.83% for 2-digit, 4-digit, and 6-digit passwords, respectively, using real-time video streaming as input with known length condition. Under the unknown length condition, the proposed method reaches 85.50%, 76.15%, and 77.89% accuracy for 2-digit, 4-digit, and 6-digit passwords, respectively.

Yang, W., & Zhang, Y. (2024). A global correction framework for camera registration in video see-through augmented reality systems. Journal of Computing and Information Science in Engineering, 24(3), 031003.

PDF DOI Website

@article{yang2024global,
  title = {A global correction framework for camera registration in video see-through augmented reality systems},
  author = {Yang, Wenhao and Zhang, Yunbo},
  journal = {Journal of Computing and Information Science in Engineering},
  volume = {24},
  number = {3},
  pages = {031003},
  year = {2024},
  publisher = {American Society of Mechanical Engineers},
  url = {https://asmedigitalcollection.asme.org/computingengineering/article/24/3/031003/1166670},
  doi = {10.1115/1.4063350},
  pdf = {yang2024global.pdf}
}

Augmented reality (AR) enhances the user’s perception of the real environment by superimposing virtual images generated by computers. These virtual images provide additional visual information that complements the real-world view. AR systems are rapidly gaining popularity in various manufacturing fields such as training, maintenance, assembly, and robot programming. In some AR applications, it is crucial for the invisible virtual environment to be precisely aligned with the physical environment to ensure that human users can accurately perceive the virtual augmentation in conjunction with their real surroundings. The process of achieving this accurate alignment is known as calibration. During some robotics applications using AR, we observed instances of misalignment in the visual representation within the designated workspace. This misalignment can potentially impact the accuracy of the robot’s operations during the task. Based on the previous research on AR-assisted robot programming systems, this work investigates the sources of misalignment errors and presents a simple and efficient calibration procedure to reduce the misalignment accuracy in general video see-through AR systems. To accurately superimpose virtual information onto the real environment, it is necessary to identify the sources and propagation of errors. In this work, we outline the linear transformation and projection of each point from the virtual world space to the virtual screen coordinates. An offline calibration method is introduced to determine the offset matrix from the head-mounted display (HMD) to the camera, and experiments are conducted to validate the improvement achieved through the calibration process.

Xian, C., Zhang, J., Yang, W., & Zhang, Y. (2024). Multi-scale progressive fusion-based depth image completion and enhancement for industrial collaborative robot applications. Journal of Intelligent Manufacturing, 35(5), 2119–2135.

PDF DOI Website

@article{xian2024multi,
  title = {Multi-scale progressive fusion-based depth image completion and enhancement for industrial collaborative robot applications},
  author = {Xian, Chuhua and Zhang, Jun and Yang, Wenhao and Zhang, Yunbo},
  journal = {Journal of intelligent manufacturing},
  volume = {35},
  number = {5},
  pages = {2119--2135},
  year = {2024},
  publisher = {Springer US New York},
  pdf = {xian2024multi.pdf},
  url = {https://link.springer.com/article/10.1007/s10845-023-02299-7},
  doi = {10.1007/s10845-023-02299-7}
}

The depth image obtained by consumer-level depth cameras generally has low resolution and missing regions due to the limitations of the depth camera hardware and the method of depth image generation. Despite the fact that many studies have been done on RGB image completion and super-resolution, a key issue with depth images is that there will be evident jagged boundaries and a significant loss of geometric information. To address these issues, we introduce a multi-scale progressive fusion network for depth image completion and super-resolution in this paper, which has an asymptotic structure for integrating hierarchical features in different domains. We employ two separate branches to learn the features of a multi-scale image given a depth image and its corresponding RGB image. The extracted features are then fused into different level features of these two branches using a step-by-step strategy to recreate the final depth image. To confine distinct borders and geometric features, a multi-dimension loss is also designed. Extensive depth completion and super-resolution studies reveal that our proposed method outperforms state-of-the-art methods both qualitatively and quantitatively. The proposed methods are also applied to two human–robot interaction applications, including a remote-controlled robot based on an unmanned ground vehicle (UGV), AR-based toolpath planning, and automatic toolpath extraction. All these experimental results indicate the effectiveness and potential benefits of the proposed methods.

Refereed Conference Proceedings

Liu, Y., Liang, J., Fan, H., Yang, W., Cui, Y., Han, X., Huangg, L., Liu, D., Wang, Q., & Han, C. (2026). All you need is one: Capsule prompt tuning with a single vector. Advances in Neural Information Processing Systems, 38, 88139–88166.

PDF Website

@inproceedings{liu2026all,
  title = {All you need is one: Capsule prompt tuning with a single vector},
  author = {Liu, Yiyang and Liang, James and Fan, Heng and Yang, Wenhao and Cui, Yiming and Han, Xiaotian and Huangg, Lifu and Liu, Dongfang and Wang, Qifan and Han, Cheng},
  journal = {Advances in Neural Information Processing Systems},
  volume = {38},
  pages = {88139--88166},
  year = {2026},
  url = {https://proceedings.neurips.cc/paper_files/paper/2025/hash/7f8b8bc8ebac661c442c4dafd5d98c08-Abstract-Conference.html},
  pdf = {NeurIPS-2025-all-you-need-is-one-capsule-prompt-tuning-with-a-single-vector-Paper-Conference.pdf}
}

Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03% average accuracy on T5-Large), serving as an "attention anchor," while enjoying high parameter efficiency (e.g., 0.003% of model parameters on Llama3.2-1B).

Wang, T., Han, C., Liang, J., Yang, W., Liu, D., Zhang, L. X., Wang, Q., Luo, J., & Tang, R. (2025). Exploring the adversarial vulnerabilities of vision-language-action models in robotics. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6948–6958. arXiv:2411.13587.

PDF DOI arXiv Website Project

@inproceedings{wang2025exploring,
  title = {Exploring the adversarial vulnerabilities of vision-language-action models in robotics},
  author = {Wang, Taowen and Han, Cheng and Liang, James and Yang, Wenhao and Liu, Dongfang and Zhang, Luna Xinyu and Wang, Qifan and Luo, Jiebo and Tang, Ruixiang},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages = {6948--6958},
  year = {2025},
  note = {arXiv:2411.13587},
  doi = {10.48550/arXiv.2411.13587},
  url = {https://openaccess.thecvf.com/content/ICCV2025/html/Wang_Exploring_the_Adversarial_Vulnerabilities_of_Vision-Language-Action_Models_in_Robotics_ICCV_2025_paper.html},
  arxiv = {https://arxiv.org/abs/2411.13587},
  project = {/research/#exploring-the-adversarial-vulnerabilities-of-vision-language-action-models-in-robotics},
  pdf = {Wang_Exploring_the_Adversarial_Vulnerabilities_of_Vision-Language-Action_Models_in_Robotics_ICCV_2025_paper.pdf}
}

Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. Despite their significant capabilities, VLA models introduce new attack surfaces. This paper systematically evaluates their robustness. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera’s view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, we advance both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for continuously developing robust defense strategies prior to physical-world deployments.

Yang, W., Bai, S., & Zhang, Y. (2024). RADAR: Robotics Assembly by Demonstration via Augmented Reality. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 7063–7070.

DOI Website

@inproceedings{yang2024radar,
  title = {RADAR: Robotics Assembly by Demonstration via Augmented Reality},
  author = {Yang, Wenhao and Bai, Shi and Zhang, Yunbo},
  booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  pages = {7063--7070},
  year = {2024},
  organization = {IEEE},
  url = {https://ieeexplore.ieee.org/document/10801493},
  doi = {10.1109/IROS58592.2024.10801493}
}

With the widespread adoption of robots in high-mix, low-volume manufacturing, and the challenges posed by long-horizon assembly tasks, we introduce the RADAR system—an integrated human-robot collaboration system for Robotic Assembly by Demonstration via Augmented Reality. Existing frameworks lack a comprehensive, cross-task framework for effective assembly collaboration, limiting their applicability in complex tasks. We designed the RADAR system’s conceptual model, detailing its workflow and components. The system integrates human input into robotic metal beam assembly through augmented reality interactions and interfaces. We also developed a task planner that dynamically adjusts human-robot assembly tasks at coarse-fine resolutions. Validating through practical scenarios, particularly the RAMP assembly benchmark, showed that human involvement significantly enhances assembly precision and success rates, proving RADAR’s effectiveness and efficiency in human-robot collaborative assembly.

Yang, W., & Zhang, Y. (2022). Visualization error analysis for augmented reality stereo video see-through head-mounted displays in industry 4.0 applications. International Manufacturing Science and Engineering Conference, 85819, V002T06A016.

PDF DOI Website

@inproceedings{yang2022visualization,
  title = {Visualization error analysis for augmented reality stereo video see-through head-mounted displays in industry 4.0 applications},
  author = {Yang, Wenhao and Zhang, Yunbo},
  booktitle = {International Manufacturing Science and Engineering Conference},
  volume = {85819},
  pages = {V002T06A016},
  year = {2022},
  organization = {American Society of Mechanical Engineers},
  url = {https://asmedigitalcollection.asme.org/MSEC/proceedings-abstract/MSEC2022/V002T06A016/1147062},
  doi = {10.1115/MSEC2022-85440},
  pdf = {yang2022visualization.pdf}
}

Under the fourth industrial revolution (Industry 4.0), Augmented Reality (AR) provides new affordances for a variety of applications, such as AR-based human-robot interaction, virtual assembly assistance, and workforce virtual training. The see-through head-mounted displays (STHMDs), based on either optical see-through or video see-through technologies, are the primary AR device to augment the visual perception of the real environment with computer-generated contents through a hand-free headset. Specifically, the video see-through STHMDs process the superimposing of the real environment and virtual contents based on the digital images and output it to users, while optical see-through STHMDs display virtual contents through the optics-based near-eyes display with users’ normal view of the real scene kept. For both types of AR devices, the accuracy of visualization is essential. For example, in AR-based human-robot interaction, the inaccurate rendering of 3D virtual objects with respect to the real environment, will lead to users’ mistaking operations, and therefore, causes an invalid tool path planning result. In spite of many works related to system calibration and error reduction for optical see-through STHMDs, there are few efforts at figuring out the nature and factors of those errors in video see-through STHMDs. In this paper, taking consumer-available AR video see-through STHMDs as an example, we identify error sources of registration and build a mathematical model of the display progress to describe the error propagation in the stereo video see-through systems. Then, based on the mathematical model of the system, the sensitivity of each error source to the final registration error is analyzed. Finally, possible solutions of error correction are suggested and summarized in the general video see-through STHMDs.

Yang, W., Xiao, Q., & Zhang, Y. (2021). An augmented-reality based human-robot interface for robotics programming in the complex environment. International Manufacturing Science and Engineering Conference, 85079, V002T07A003.

DOI Website

@inproceedings{yang2021augmented,
  title = {An augmented-reality based human-robot interface for robotics programming in the complex environment},
  author = {Yang, Wenhao and Xiao, Qinqin and Zhang, Yunbo},
  booktitle = {International Manufacturing Science and Engineering Conference},
  volume = {85079},
  pages = {V002T07A003},
  year = {2021},
  organization = {American Society of Mechanical Engineers},
  url = {https://asmedigitalcollection.asme.org/MSEC/proceedings-abstract/MSEC2021/85079/1115433},
  doi = {10.1115/MSEC2021-62468}
}

TTo solve the problems of complex robot programming tasks, we propose an Augmented Reality (AR) based human-robot interface for planning a collision-free path in a complex environment. Current robot programming methods usually require a high level of experience in robot programming (online programming), the time-consuming 3D modeling of the working environment for collision detection (offline programming), and a tedious and inefficient re-planing to adapt environment or task changes (both online and offline programming). In order to address these problems, an end-to-end AR human-robot interface is proposed, which provides a new affordance to users by enabling them to plan the path in the AR environment. A set of user-interactive tools allow users to define and edit waypoints as the high-level guidance and the direct inputs for the toolpath planning package, Kinematics and Dynamics Library (KDL). With the fast sensing of the workspace and accurate rendering, an in-situ simulation module is utilized for collision check and verification by the users’ perception. Users will repeat the process of 1) waypoints definition and editing, and 2) the collision checking and path feasibility verification, until a satisfactory path is obtained. A preliminary testing is conducted in a use case with complex obstacles to verified the effectiveness and the efficiency of the proposed interface.