Exploring crowd counting methodology by integrating CNN and transformer: performance optimization under weak supervision
Affiliation of Author(s):
信息与控制工程学院
Journal:
SIGNAL IMAGE AND VIDEO PROCESSING
Key Words:
Crowd counting
CNN · Transformer
Weakly supervised
Abstract:
Crowd counting, an essential aspect in surveillance and traffic management, constitutes the task of estimating the number of
individuals present within an image, serving as a crucial determinant for various operational decisions and security measures
in these domains, traditionally relies on Convolutional Neural Networks (CNNs), excelling at local feature extraction yet
falling short in capturing global context. Conversely, Transformers excel in capturing long-range dependencies but often
overlook local intricacies. Current methodologies in crowd counting heavily depend on precise position-level annotations for
supervised training, a process demanding significant time and labor. This has spurred interest in weakly supervised training,
where models learn solely from count-level population annotations, holding immense practical and research potential. In
our study, we propose TCCNet, a novel weakly supervised network marrying CNNs and Transformers for crowd counting.
Addressing CNN’s limitation in global feature extraction, we integrated the Transformer model to enhance crowd counting
accuracy by capturing extensive contextual information. Further bolstering the Transformer block with Post Normalization and
Scaled Cosine Attention smoothed activation values and improved model stability. Moreover, our crowd counting regression
block, incorporating inflated convolutions, expanded the model’s perceptual scope while maintaining spatial resolution,
significantly benefiting crowd counting. Through extensive experimentation on five publicly available datasets and illustrative
visualizations, TCCNet showcases remarkable proficiency in accurately identifying crowd regions within images. Our findings
highlight the model’s exceptional counting performance, particularly in weakly supervised learning.