Searching for MobileNetV3
Andrew G. Howard, M. Sandler, Grace Chu
et al.
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 20% compared to MobileNetV2. MobileNetV3-Small is 6.6% more accurate compared to a MobileNetV2 model with comparable latency. MobileNetV3-Large detection is over 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
9011 sitasi
en
Computer Science
YOLO9000: Better, Faster, Stronger
J. Redmon, Ali Farhadi
We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that dont have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. YOLO9000 predicts detections for more than 9000 different object categories, all in real-time.
17410 sitasi
en
Computer Science
XGBoost: A Scalable Tree Boosting System
Tianqi Chen, Carlos Guestrin
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
51874 sitasi
en
Computer Science
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross B. Girshick
et al.
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available
71643 sitasi
en
Computer Science, Medicine
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe
et al.
Convolutional networks are at the core of most state of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21:2% top-1 and 5:6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3:5% top-5 error and 17:3% top-1 error on the validation set and 3:6% top-5 error on the official test set.
30601 sitasi
en
Computer Science
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, M. Maire, Serge J. Belongie
et al.
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
51714 sitasi
en
Computer Science
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky, Jia Deng, Hao Su
et al.
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.
42265 sitasi
en
Computer Science
Convolutional Neural Networks for Sentence Classification
Yoon Kim
We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.
14114 sitasi
en
Computer Science
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan, Andrew Zisserman
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
110417 sitasi
en
Computer Science
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia
et al.
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
46909 sitasi
en
Computer Science
Fully convolutional networks for semantic segmentation
Evan Shelhamer, Jonathan Long, Trevor Darrell
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.
41479 sitasi
en
Computer Science
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, G. Corrado
et al.
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
34030 sitasi
en
Computer Science
Optimizing Hyperparameters of Neural-Based Image Compressors
Lucas S. Lopes, Ricardo L. de Queiroz
The performance of neural image coders is heavily dependent on their architecture and, hence, on the selection of hyperparameters. Such performance, for a given architecture, is often ascertained by trial, that is, after training and inference, so that many trials may be conducted to select the hyperparameters. We propose a multi-objective hyperparameter optimization (MOHPO) method for neural image compression based on rate-distortion-complexity (RDC) analysis, which drastically reduces the number of networks to try (train and test), thereby saving resources. We validate it on well-established benchmark problems and demonstrate its use with popular autoencoders, measuring their complexities in terms of the number of parameters and floating-point operations. Our method, which we refer to as the greedy lower convex hull (GLCH), aims to track the lower convex hull of a cloud of hyperparameter possibilities. We compare our method with other well-established state-of-the-art MOHPO methods in terms of log-hypervolume difference as a function of the number of trained networks. The results indicate that the proposed method is highly competitive, particularly with fewer trained networks, which is a critical scenario in practice. Furthermore, it is deterministic, that is, it remains consistent across different runs.
Electrical engineering. Electronics. Nuclear engineering
Design of public space guide system based on augmented reality technology
Pu Jiao, Limin Ran
Abstract With the rapid development of science and technology, augmented reality technology provides intelligent and application services. The research is based on imaging techniques using augmented reality technology and camera image capture. Then, it uses screen error algorithms and scale-invariant feature transformation operators to test the quality of scene spatial models. The experimental results demonstrated that the camera significantly improved the frame rate of scene model rendering and could steadily enhance rendering efficiency. For image quality and its influencing factors, binary robust invariant scalable keypoints and scale-invariant feature transformation algorithms in viewpoint changes had the highest recall of 92%. The map drawing module, Hessian matrix, and scale-invariant feature transformation algorithm in the image blurring test achieved the highest recall rate of 98%. This demonstrates the advantage of using a scale-invariant feature transformation operator to capture scene space influence and provide a more accurate spatial model reference for augmented reality technology. This enhances the functional design of the guide system.
Computational linguistics. Natural language processing, Electronic computers. Computer science
InVDriver: Intra-instance aware vectorized query-based autonomous driving transformer
Bo Zhang, Heye Huang, Chunyang Liu
et al.
End-to-end autonomous driving, with its holistic optimization capabilities, has gained increasing traction in academia and industry. Vectorized representations, which preserve instance-level topological information while reducing computational overhead, have emerged as promising paradigms. However, existing vectorized query-based frameworks often overlook the inherent spatial correlations among intra-instance points, resulting in geometrically inconsistent outputs (e.g., fragmented HD map elements or oscillatory trajectories). To address these limitations, we propose intra-instance vectorized driving transformer (InVDriver), a novel vectorized query-based system that systematically models intra-instance spatial dependencies through masked self-attention layers, thereby enhancing planning accuracy and trajectory smoothness. Across all core modules, i.e., perception, prediction, and planning, InVDriver incorporates masked self-attention mechanisms that restrict attention to intra-instance point interactions, enabling coordinated refinement of structural elements while suppressing irrelevant inter-instance noise. The experimental results on the nuScenes benchmark demonstrate that InVDriver achieves state-of-the-art performance, surpassing prior methods in both accuracy and safety, while maintaining high computational efficiency.
Motor vehicles. Aeronautics. Astronautics
LightNet: a lightweight head pose estimation model for online education and its application to engagement assessment
Lin Zheng, Jinlong Li, Zhanbo Zhu
et al.
Abstract In recent years, with the popularization of online education, real-time monitoring of learning engagement has become a key challenge for scholars. Existing studies mainly rely on questionnaires and physiological signal detection, which have limitations such as high subjectivity, poor real-time performance, and expensive equipment. Previous research has shown that head pose is closely related to cognitive state. However, current estimation models require substantial computational resources, making real-time deployment on mobile devices challenging. In this study, we validate the significant correlation between head pose and learning engagement based on the DAiSEE dataset (8,925 video clips) and propose a lightweight head pose estimation method. The LightNet proposed in this paper uses an improved feature extraction module (MG-Net) and an Attention-based multi-scale fusion model (AMF). Experiments conducted on the 300W-LP and BIWI benchmark datasets demonstrate that, compared with existing state-of-the-art methods, LightNet substantially reduces model complexity by decreasing the number of parameters to just 0.45 $$\times 10^6$$ × 10 6 , representing over 90% reduction in model size. Despite this significant compression, LightNet maintains a high level of accuracy, with the mean absolute error (MAE) increasing by only 0.15°, indicating a minimal loss in prediction precision. Moreover, the model achieves a notable improvement in processing speed, exceeding 50% increase relative to baseline approaches. This combination of a lightweight architecture, competitive accuracy, and accelerated inference speed underscores LightNet’s effectiveness and its potential suitability for real-time applications. This study not only expands the application of head pose in education but also provides a feasible solution for real-time engagement monitoring on resource-constrained devices.
Electronic computers. Computer science
Why Open Small AI Models Matter for Interactive Art
Mar Canet Sola, Varvara Guljajeva
This position paper argues for the importance of open small AI models in creative independence for interactive art practices. Deployable locally, these models offer artists vital control over infrastructure and code, unlike dominant large, closed-source corporate systems. Such centralized platforms function as opaque black boxes, imposing severe limitations on interactive artworks, including restrictive content filters, preservation issues, and technical challenges such as increased latency and limited interfaces. In contrast, small AI models empower creators with more autonomy, control, and sustainability for these artistic processes. They enable the ability to use a model as long as they want, create their own custom model, either by making code changes to integrate new interfaces, or via new datasets by re-training or fine-tuning the model. This fosters technological self-determination, offering greater ownership and reducing reliance on corporate AI ill-suited for interactive art's demands. Critically, this approach empowers the artist and supports long-term preservation and exhibition of artworks with AI components. This paper explores the practical applications and implications of using open small AI models in interactive art, contrasting them with closed-source alternatives.
NP-membership for the boundary-boundary art-gallery problem
Jack Stade
The boundary-boundary art-gallery problem asks, given a polygon $P$ representing an art-gallery, for a minimal set of guards that can see the entire boundary of $P$ (the wall of the art gallery), where the guards must be placed on the boundary. That is, for each point on the boundary, there should be a line segment connecting it to one of the guards that is contained in $P$. We show that this art-gallery variant is in NP, even if the polygon can have holes. In order to prove this, we develop a constraint-propagation procedure for continuous constraint satisfaction problems where each constraint involves at most 2 variables. The X-Y variant of the art-gallery problem is the one where the guards must lie in X and need to see all of Y. Each of X and Y can be either the vertices of the polygon, the boundary of the polygon, or the entire polygon, giving 9 different variants. Previously, it was known that X-vertex and vertex-Y variants are all NP-complete and that the point-point, point-boundary, and boundary-point variants are $\exists \mathbb{R}$-complete [Abrahamsen, Adamaszek, and Miltzow, JACM 2021][Stade, SoCG 2025]. However, the boundary-boundary variant was only known to lie somewhere between NP and $\exists \mathbb{R}$. The X-vertex and vertex-Y variants can be straightforwardly reduced to discrete set-cover instances. In contrast, we give example to show that a solution to an instance of the boundary-boundary art-gallery problem sometimes requires placing guards at irrational coordinates, so it unlikely that the problem can be easily discretized.
Adaptation of welding transformer parameters taking into account arc output characteristics
Savchuk V. S., Plekhov A. S.
Widely used in modern technology welding of metals is a complex technological process. To assess the quality of a welded joint and its compliance with operational requirements, mathematical modeling methods are used. The mathematical model of a pulse transformer has been developed using the Matlab Simulink software package and software modules from Schneider Electric. The output data have been verified at the manufacturing company NPK Etalon (Rostov Region, Volgodonsk), where the obtained data have been compared with the theoretical base. The mathematical model of a pulse transformer allows implementing the STT (Surface Tension Transfer – heat and mass transfer due to the mechanism of surface tension forces) welding with a deeper calibration of current power pulses with clearly defined electrical parameters. It allows as well solving the problem of switching welding modes in complex operating scenarios; it is applicable to various types of welding (contact, arc, beam, etc.).
A Seamless Technology Integration Framework for Elderly-Centered Interactive Systems: Design, Implementation, and Validation Through the Pillow Fight System
Chor-Kheng Lim, Hung-Yu Chen, Xuan-Yu Chen
The rapid aging of the global population has highlighted the urgent need for age-friendly technological solutions. However, our review of the existing studies revealed significant gaps between elderly users’ needs and the current smart technology products. While previous research has explored various aspects of elderly technology design, our systematic analysis indicated limitations, in terms of integrated frameworks, empirically validated interaction models, and long-term effectiveness evaluation. This study proposes a seamless technology framework for the development of elderly-centered interactive systems—named Pillow Fight—which we validated through a mixed-methods study of an innovative intergenerational gaming system. This research was conducted in two phases across 34 sites, with 1997 participants. Phase one (n = 659) established the framework’s effectiveness through systematic field testing, while phase two (n = 1338) validated its scalability and long-term benefits. The results highlight significant improvements in system usability scores (SUS), from individual use (77.85) to group activities (85.32), with intergenerational interaction achieving the highest scores (92.12). The integration of health monitoring features further enhanced learnability (94.12) usability (82.15). This study contributes to the design of technology for the elderly through (1) establishing an integrated theoretical framework for seamless technology integration, (2) developing and validating innovative intergenerational interaction models, and (3) providing empirical evidence through systematic field studies. These contributions could advance both theoretical understanding and practical applications while maintaining high user-friendliness.
Technology, Engineering (General). Civil engineering (General)