This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.
We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
F. Iandola, Matthew W. Moskewicz, Khalid Ashraf
et al.
Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: this https URL
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.
It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. The resulting method can be used to train high-performance architectures for real-time image generation. The code will is made available on github at this https URL. Full paper can be found at arXiv:1701.02096.
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG-16 found that the number of parameters can be reduced by 13x, from 138 million to 10.3 million, again with no loss of accuracy.
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks CNNs succeeded at. In this paper we construct CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations. Since existing ground truth data sets are not sufficiently large to train a CNN, we generate a large synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.
Several variants of the long short-term memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful functional ANalysis Of VAriance framework. In total, we summarize the results of 5400 experimental runs ( $\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.
The field of machine learning has taken a dramatic twist in recent times, with the rise of the Artificial Neural Network (ANN). These biologically inspired computational models are able to far exceed the performance of previous forms of artificial intelligence in common machine learning tasks. One of the most impressive forms of ANN architecture is that of the Convolutional Neural Network (CNN). CNNs are primarily used to solve difficult image-driven pattern recognition tasks and with their precise yet simple architecture, offers a simplified method of getting started with ANNs. This document provides a brief introduction to CNNs, discussing recently published papers and newly formed techniques in developing these brilliantly fantastic image recognition models. This introduction assumes you are familiar with the fundamentals of ANNs and machine learning.
Naledi Lenah Adam, Grzegorz Kowalik, Andrew Tyler
et al.
BackgroundSimultaneous multi-slice (SMS) bSSFP imaging enables stress myocardial perfusion imaging with high spatial resolution and increased spatial coverage. Standard parallel imaging techniques (e.g., TGRAPPA) can be used for image reconstruction but result in high noise level. Alternatively, iterative reconstruction techniques based on temporal regularization (ITER) improve image quality but are associated with reduced temporal signal fidelity and long computation time limiting their online use. The aim is to develop an image reconstruction technique for SMS-bSSFP myocardial perfusion imaging combining parallel imaging and image-based denoising using a novel noise map estimation network (NoiseMapNet), which preserves both sharpness and temporal signal profiles and that has low computational cost.MethodsThe proposed reconstruction of SMS images consists of a standard temporal parallel imaging reconstruction (TGRAPPA) with motion correction (MOCO) followed by image denoising using NoiseMapNet. NoiseMapNet is a deep learning network based on a 2D Unet architecture and aims to predict a noise map from an input noisy image, which is then subtracted from the noisy image to generate the denoised image. This approach was evaluated in 17 patients who underwent stress perfusion imaging using a SMS-bSSFP sequence. Images were reconstructed with (a) TGRAPPA with MOCO (thereafter referred to as TGRAPPA), (b) iterative reconstruction with integrated motion compensation (ITER), and (c) proposed NoiseMapNet-based reconstruction. Normalized mean squared error (NMSE) with respect to TGRAPPA, myocardial sharpness, image quality, perceived SNR (pSNR), and number of diagnostic segments were evaluated.ResultsNMSE of NoiseMapNet was lower than using ITER for both myocardium (0.045 ± 0.021 vs. 0.172 ± 0.041, p < 0.001) and left ventricular blood pool (0.025 ± 0.014 vs. 0.069 ± 0.020, p < 0.001). There were no significant differences between all methods for myocardial sharpness (p = 0.77) and number of diagnostic segments (p = 0.36). ITER led to higher image quality than NoiseMapNet/TGRAPPA (2.7 ± 0.4 vs. 1.8 ± 0.4/1.3 ± 0.6, p < 0.001) and higher pSNR than NoiseMapNet/TGRAPPA (3.0 ± 0.0 vs. 2.0 ± 0.0/1.3 ± 0.6, p < 0.001). Importantly, NoiseMapNet yielded higher pSNR (p < 0.001) and image quality (p < 0.008) than TGRAPPA. Computation time of NoiseMapNet was only 20s for one entire dataset.ConclusionNoiseMapNet-based reconstruction enables fast SMS image reconstruction for stress myocardial perfusion imaging while preserving sharpness and temporal signal profiles.
Diseases of the circulatory (Cardiovascular) system
Iranian traditional residential architecture is renowned for its central-courtyard houses, which are admired for their grandeur. While the courtyards and nearby spaces receive considerable artistic and historical appreciation, those situated further away often receive less attention. These areas are typically considered auxiliary and less functional for living, thereby receiving limited attention in architectural discussions. This study examines 26 traditional central-courtyard houses to investigate how spaces located farther from the courtyard (‘second-order’) compare to those directly adjacent (‘first-order’). It challenges the assumption that distance from the courtyard correlates with reduced functionality. Surprisingly, the analysis identifies similar architectural characteristics in both second-order and first-order spaces, suggesting that distant areas may serve functional roles comparable to those nearer the courtyard.
Radosław Rutkowski, Miłosz Raczyński, Remigiusz Iwańkowicz
et al.
The article explores the potential of Digital Twin (DT) technology in the design and dynamic assessment of the energy performance of multi-family buildings. Traditional approaches to building energy assessment provide static data that do not account for changing operational conditions and lack continuous energy consumption-monitoring capabilities. The use of Digital Twin enables monitoring and analyzing of the building’s energy parameters at every stage of its life cycle. The article presents the application of DT technology for assessing energy performance at the conceptual stage and in the early phases of design. These parameters must meet legal requirements. Validation conducted on four multi-family buildings demonstrated high accuracy, with the average difference between predicted and actual energy performance (EP) values below 3.5%. Thanks to the DT model, it is possible to determine energy parameters already at the conceptual stage, which helps avoid costly changes in later project phases. Early determination of these parameters also allows for accurate estimation of design and investment costs. Tests of the proposed solution were conducted on several multi-family buildings, comparing preliminary data with final results. The research results show that DT technology allows for precise planning of energy performance at the conceptual and preliminary design stages. This reduces operational costs, increases energy efficiency, and better adapts buildings to changing technological and legal conditions.
Shear Thickening Fluid (STF) is a specialized high-concentration particle suspension capable of rapidly and reversibly altering its viscosity when exposed to sudden impacts. Consequently, STF-based dampers deliver a self-adaptive damping force and demonstrate significant potential for applications in structural vibration control. This study presents both a modeling and experimental investigation of a novel double-rod structured STF damper. Initially, a compound STF is formulated using silica particles as the dispersed phase and polyethylene glycol solution as the dispersing medium. The rheological properties of the STF are then experimentally evaluated. The STF’s constitutive rheological behavior is described using the G-R model. Following this, the flow behavior of the STF within the damper’s annular gap is explored, leading to the development of a two-dimensional axisymmetric fluid simulation model for the damper. Based on this model, the dynamic mechanism of the proposed STF damper is analyzed. Subsequently, the STF damper is optimally designed and subjected to experimental investigation using a dynamic testing platform under different working conditions. The experimental results reveal that the proposed STF damper, whose equivalent stiffness can achieve a nearly threefold change with excitation frequency and amplitude, exhibits good self-adaptive capabilities. By dividing the damper force into two parts: the frictional damping pressure drop, and the osmotic pressure drop generated by the “Jamming effect”. A fitting model is proposed, and it aligns closely with the nonlinear performance of the STF damper.
In the 5G and beyond networks, low-latency digital signatures are essential to ensure the security, integrity, and non-repudiation of massive data in communication processes. The binary finite field-based elliptic curve digital signature algorithm (ECDSA) is particularly suitable for achieving low-latency digital signatures due to its carry-free characteristics. This paper proposes a low-latency and universal architecture for point multiplication (PM) and double point multiplication (DPM) based on the differential addition chain (DAC) designed for signing and verification in ECDSA. By employing the DAC, the area-time product of DPM can be decreased, and throughput efficiency can be increased. Besides, the execution pattern of the proposed architecture is uniform to resist simple power analysis and high-order power analysis. Based on the data dependency, two Karatsuba–Ofman multipliers and four non-pipeline squarers are utilized in the architecture to achieve a compact timing schedule without idle cycles for multipliers during the computation process. Consequently, the calculation latency of DPM is minimized to five clock cycles in each loop. The proposed architecture is implemented on Xilinx Virtex-7, performing DPM in 3.584, 5.656, and <inline-formula> <tex-math notation="LaTeX">$7.453~\mu s$ </tex-math></inline-formula> with 8135, 13372, and 17898 slices over GF(2<sup>163</sup>), GF(2<sup>233</sup>), GF(2<sup>283</sup>), respectively. In the existing designs that are resistant to high-order analysis, our architecture demonstrates throughput efficiency improvements of 36.7% over GF(2<sup>233</sup>) and 9.8% over GF(2<sup>283</sup>), respectively.