Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images
In-Hwan Jin, Haesoo Choo, Seong-Hun Jeong
et al.
To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics. Extensive experimental results are https://cvsp-lab.github.io/ICLR2025_3D-MOM/.
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
Jianyu Wu, Hao Yang, Xinhua Zeng
et al.
The diagnosis of pathological images is often limited by expert availability and regional disparities, highlighting the importance of automated diagnosis using Vision-Language Models (VLMs). Traditional multimodal models typically emphasize outcomes over the reasoning process, compromising the reliability of clinical decisions. To address the weak reasoning abilities and lack of supervised processes in pathological VLMs, we have innovatively proposed PathVLM-R1, a visual language model designed specifically for pathological images. We have based our model on Qwen2.5-VL-7B-Instruct and enhanced its performance for pathological tasks through meticulously designed post-training strategies. Firstly, we conduct supervised fine-tuning guided by pathological data to imbue the model with foundational pathological knowledge, forming a new pathological base model. Subsequently, we introduce Group Relative Policy Optimization (GRPO) and propose a dual reward-driven reinforcement learning optimization, ensuring strict constraint on logical supervision of the reasoning process and accuracy of results via cross-modal process reward and outcome accuracy reward. In the pathological image question-answering tasks, the testing results of PathVLM-R1 demonstrate a 14% improvement in accuracy compared to baseline methods, and it demonstrated superior performance compared to the Qwen2.5-VL-32B version despite having a significantly smaller parameter size. Furthermore, in out-domain data evaluation involving four medical imaging modalities: Computed Tomography (CT), dermoscopy, fundus photography, and Optical Coherence Tomography (OCT) images: PathVLM-R1's transfer performance improved by an average of 17.3% compared to traditional SFT methods. These results clearly indicate that PathVLM-R1 not only enhances accuracy but also possesses broad applicability and expansion potential.
When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks
David Isztl, Tahm Spitznagel, Gabor Mark Somfai
et al.
Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.
APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs
Bowen Liu, Weiyi Zhang, Peranut Chotcomwongse
et al.
Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.
A Preliminary, Photography-Based Assessment of Bee Diversity at the Finca Botánica Organic Farm in the Central Pacific Coast of Ecuador
Joseph S. Wilson, Tyler M. Wilson, Chris Packer
et al.
Understanding wild bee diversity is critical for pollinator conservation, particularly in understudied tropical regions like coastal Ecuador. This preliminary study provides a photography-based assessment of bee diversity at Finca Botánica, an organic and regenerative farm on Ecuador’s central Pacific coast. Over a 10-day survey in December 2024, researchers documented 51 bee species across four families, with Apidae being the most represented. The study highlights a predominance of solitary, ground-nesting bees and a lower-than-expected diversity of Meliponini (stingless bees) and Euglossini (orchid bees) compared to other regions of Ecuador. Many species were found in forest patches, ecological corridors, and cover-cropped maize fields, underscoring the role of sustainable farming practices in supporting pollinator diversity. While photographic methods provided valuable preliminary data, they also revealed limitations in species-level identification, reinforcing the need for future specimen-based surveys. These findings suggest that Ecuador’s dry coastal forests may harbor a richer bee community than previously recognized and that organic farms can serve as important refuges for native pollinators.
Patient-derived VIDA and dermoscopy versus sequential planimetry for assessing activity in vitiligo patients receiving narrowband ultraviolet B: a comparative longitudinal study
Ali F. El Eriny, Eman H. Elmorsy, Carmen I. Farid
et al.
Background
Judging the stability/activity of vitiligo is of therapeutic and prognostic significance. Clinical and biochemical activity indicators are available, but none is currently universally accepted.
Objective
To compare the accuracy of patient-derived vitiligo disease activity score (VIDA) and dermoscopy, versus sequential planimetry in determining vitiligo stability/activity.
Patients and methods
A total of 120 nonsegmental otherwise healthy vitiligo patients were included, who reported disease stability for the last 6 months. Global assessment and photography were followed by choosing a stable non-repigmenting lesion for longitudinal evaluation of planimetric changes, and dermoscopy. Patients received the standard protocol of narrow band ultraviolet B for 3 months, and patient recall for global and lesional progression was recorded. Accuracy of patient-derived VIDA and dermoscopic signs, versus planimetric changes as the reference were compared for judging vitiligo stability/activity.
Results
Patient-derived VIDA showed 58.33% accuracy in determining disease stability while Dermoscopy showed 97.50% overall accuracy. The absence of a pigment network and the absence of signs of activity (collective absence of ill-defined border, satellite lesions, and microkoebnerization) showed 100% accuracy in establishing disease stability. On the other hand, the presence of any microkoebnerization, starburst sign, or satellite/tapioca sago sign was 100% indicative of activity. Border definition gave a 91.67% overall accuracy for judging stability, while Perifollicular pigmentation showed an accuracy of 49.17%.
Conclusion
Patient-derived VIDA is generally unreliable, whereas longitudinal follow-up for planimetric changes, and dermoscopic assessment are of better reliability. The absence of pigment network, and collective absence of ill-defined border, microkoebnerization, and satellite lesions are 100% accurate in predicting stable vitiligo.
bit2bit: 1-bit quanta video reconstruction via self-supervised photon prediction
Yehe Liu, Alexander Krull, Hector Basevi
et al.
Quanta image sensors, such as SPAD arrays, are an emerging sensor technology, producing 1-bit arrays representing photon detection events over exposures as short as a few nanoseconds. In practice, raw data are post-processed using heavy spatiotemporal binning to create more useful and interpretable images at the cost of degrading spatiotemporal resolution. In this work, we propose bit2bit, a new method for reconstructing high-quality image stacks at the original spatiotemporal resolution from sparse binary quanta image data. Inspired by recent work on Poisson denoising, we developed an algorithm that creates a dense image sequence from sparse binary photon data by predicting the photon arrival location probability distribution. However, due to the binary nature of the data, we show that the assumption of a Poisson distribution is inadequate. Instead, we model the process with a Bernoulli lattice process from the truncated Poisson. This leads to the proposal of a novel self-supervised solution based on a masked loss function. We evaluate our method using both simulated and real data. On simulated data from a conventional video, we achieve 34.35 mean PSNR with extremely photon-sparse binary input (<0.06 photons per pixel per frame). We also present a novel dataset containing a wide range of real SPAD high-speed videos under various challenging imaging conditions. The scenes cover strong/weak ambient light, strong motion, ultra-fast events, etc., which will be made available to the community, on which we demonstrate the promise of our approach. Both reconstruction quality and throughput substantially surpass the state-of-the-art methods (e.g., Quanta Burst Photography (QBP)). Our approach significantly enhances the visualization and usability of the data, enabling the application of existing analysis techniques.
CDI: Copyrighted Data Identification in Diffusion Models
Jan Dubiński, Antoni Kowalczuk, Franziska Boenisch
et al.
Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features for these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data. We make the code available at https://github.com/sprintml/copyrighted_data_identification
AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior
Guoqiang Liang, Qingnan Fan, Bingtao Fu
et al.
Blind face restoration (BFR) is a fundamental and challenging problem in computer vision. To faithfully restore high-quality (HQ) photos from poor-quality ones, recent research endeavors predominantly rely on facial image priors from the powerful pretrained text-to-image (T2I) diffusion models. However, such priors often lead to the incorrect generation of non-facial features and insufficient facial details, thus rendering them less practical for real-world applications. In this paper, we propose a novel framework, namely AuthFace that achieves highly authentic face restoration results by exploring a face-oriented generative diffusion prior. To learn such a prior, we first collect a dataset of 1.5K high-quality images, with resolutions exceeding 8K, captured by professional photographers. Based on the dataset, we then introduce a novel face-oriented restoration-tuning pipeline that fine-tunes a pretrained T2I model. Identifying key criteria of quality-first and photography-guided annotation, we involve the retouching and reviewing process under the guidance of photographers for high-quality images that show rich facial features. The photography-guided annotation system fully explores the potential of these high-quality photographic images. In this way, the potent natural image priors from pretrained T2I diffusion models can be subtly harnessed, specifically enhancing their capability in facial detail restoration. Moreover, to minimize artifacts in critical facial areas, such as eyes and mouth, we propose a time-aware latent facial feature loss to learn the authentic face restoration process. Extensive experiments on the synthetic and real-world BFR datasets demonstrate the superiority of our approach.
ECG-Image-Database: A Dataset of ECG Images with Real-World Imaging and Scanning Artifacts; A Foundation for Computerized ECG Image Digitization and Analysis
Matthew A. Reyna, Deepanshi, James Weigle
et al.
We introduce the ECG-Image-Database, a large and diverse collection of electrocardiogram (ECG) images generated from ECG time-series data, with real-world scanning, imaging, and physical artifacts. We used ECG-Image-Kit, an open-source Python toolkit, to generate realistic images of 12-lead ECG printouts from raw ECG time-series. The images include realistic distortions such as noise, wrinkles, stains, and perspective shifts, generated both digitally and physically. The toolkit was applied to 977 12-lead ECG records from the PTB-XL database and 1,000 from Emory Healthcare to create high-fidelity synthetic ECG images. These unique images were subjected to both programmatic distortions using ECG-Image-Kit and physical effects like soaking, staining, and mold growth, followed by scanning and photography under various lighting conditions to create real-world artifacts. The resulting dataset includes 35,595 software-labeled ECG images with a wide range of imaging artifacts and distortions. The dataset provides ground truth time-series data alongside the images, offering a reference for developing machine and deep learning models for ECG digitization and classification. The images vary in quality, from clear scans of clean papers to noisy photographs of degraded papers, enabling the development of more generalizable digitization algorithms. ECG-Image-Database addresses a critical need for digitizing paper-based and non-digital ECGs for computerized analysis, providing a foundation for developing robust machine and deep learning models capable of converting ECG images into time-series. The dataset aims to serve as a reference for ECG digitization and computerized annotation efforts. ECG-Image-Database was used in the PhysioNet Challenge 2024 on ECG image digitization and classification.
DISCOVER: 2-D Multiview Summarization of Optical Coherence Tomography Angiography for Automatic Diabetic Retinopathy Diagnosis
Mostafa El Habib Daho, Yihao Li, Rachid Zeghlache
et al.
Diabetic Retinopathy (DR), an ocular complication of diabetes, is a leading cause of blindness worldwide. Traditionally, DR is monitored using Color Fundus Photography (CFP), a widespread 2-D imaging modality. However, DR classifications based on CFP have poor predictive power, resulting in suboptimal DR management. Optical Coherence Tomography Angiography (OCTA) is a recent 3-D imaging modality offering enhanced structural and functional information (blood flow) with a wider field of view. This paper investigates automatic DR severity assessment using 3-D OCTA. A straightforward solution to this task is a 3-D neural network classifier. However, 3-D architectures have numerous parameters and typically require many training samples. A lighter solution consists in using 2-D neural network classifiers processing 2-D en-face (or frontal) projections and/or 2-D cross-sectional slices. Such an approach mimics the way ophthalmologists analyze OCTA acquisitions: 1) en-face flow maps are often used to detect avascular zones and neovascularization, and 2) cross-sectional slices are commonly analyzed to detect macular edemas, for instance. However, arbitrary data reduction or selection might result in information loss. Two complementary strategies are thus proposed to optimally summarize OCTA volumes with 2-D images: 1) a parametric en-face projection optimized through deep learning and 2) a cross-sectional slice selection process controlled through gradient-based attribution. The full summarization and DR classification pipeline is trained from end to end. The automatic 2-D summary can be displayed in a viewer or printed in a report to support the decision. We show that the proposed 2-D summarization and classification pipeline outperforms direct 3-D classification with the advantage of improved interpretability.
Video WeAther RecoGnition (VARG): An Intensity-Labeled Video Weather Recognition Dataset
Himanshu Gupta, Oleksandr Kotlyar, Henrik Andreasson
et al.
Adverse weather (rain, snow, and fog) can negatively impact computer vision tasks by introducing noise in sensor data; therefore, it is essential to recognize weather conditions for building safe and robust autonomous systems in the agricultural and autonomous driving/drone sectors. The performance degradation in computer vision tasks due to adverse weather depends on the type of weather and the intensity, which influences the amount of noise in sensor data. However, existing weather recognition datasets often lack intensity labels, limiting their effectiveness. To address this limitation, we present VARG, a novel video-based weather recognition dataset with weather intensity labels. The dataset comprises a diverse set of short video sequences collected from various social media platforms and videos recorded by the authors, processed into usable clips, and categorized into three major weather categories, rain, fog, and snow, with three intensity classes: absent/no, moderate, and high. The dataset contains 6742 annotated clips from 1079 videos, with the training set containing 5159 clips and the test set containing 1583 clips. Two sets of annotations are provided for training, the first set to train the models as a multi-label weather intensity classifier and the second set to train the models as a multi-class classifier for three weather scenarios. This paper describes the dataset characteristics and presents an evaluation study using several deep learning-based video recognition approaches for weather intensity prediction.
Photography, Computer applications to medicine. Medical informatics
Multi‐exposure embeddings for graph learning: Towards high dynamic range image saliency prediction
Jun Xing, Qiudan Zhang, Xuelin Shen
et al.
Abstract Identifying saliency in high dynamic range (HDR) images is a fundamentally important issue in HDR imaging, and plays critical roles towards comprehensive scene understanding. Most of existing studies leverage hand‐crafted features for HDR image saliency prediction, lacking the capabilities of fully exploiting the characteristics of HDR image (i.e. wider luminance range and richer colour gamut). Here, systematical studies are carried out on HDR image saliency prediction by proposing a new framework to single out the contributions from multi‐exposure images. Specifically, inspired by the mechanism of HDR imaging, the method first utilizes graph neural networks to model the relations among multi‐exposure images and the tone‐mapped image obtained from an HDR image, enabling more discriminative saliency‐related feature representations. Subsequently, the saliency features driven by global semantic knowledge are aggregated from the tone‐mapped image through enhancing global context‐aware semantic information. Finally, a fusion module is designed to integrate saliency‐oriented feature representations originated from multi‐exposure images and the tone‐mapped image, producing the saliency maps of HDR images. Moreover, a new challenging HDR eye fixation database (HDR‐EYEFix) is created, expecting to further contribute the research on HDR image saliency prediction. Experiment results show that the method obtains superior performance compared to the state‐of‐the‐art methods.
Photography, Computer software
Optimising spectroscopic observations of transiting exoplanets
Linn Boldt-Christmas, Fabio Lesjak, Ansgar Wehrhahn
et al.
When observing the atmospheres of transiting exoplanets using high-resolution spectroscopy, one aims to detect well-resolved spectral features with high signal-to-noise ratios (SNR) as is possible today with modern spectrographs. However, obtaining such high-quality observations comes with a trade-off: a lower cadence of fewer, longer exposures across the transit collects more photons thanks to reduced overheads, enhancing the SNR of each observation, while a higher cadence of several, shorter exposures minimises spectral feature smearing due to the continuously changing radial velocity of the planet. Considering that maximising SNR and minimising smearing are both beneficial to analysis, there is a need to establish where the optimal compromise lies. In this work, we model real transit events based on targets as they would be observed with VLT/CRIRES+ at Paranal Observatory. Creating four hypothetical scenarios, we simulate each observation across 100 realisations of the same transit event in order to vary the time resolution only. We remove telluric and stellar lines using the SYSREM algorithm and analyse them through cross-correlation with model templates, measuring how successfully each time resolution and case detects the planetary signal. We demonstrate that there is a continuous change in the detection significance based on time resolutions, and that the function of this significance has clear maxima. The strength and location of this maxima varies on e.g. planet system parameters, instrumentation, and no. of removal iterations. We discuss why observers should therefore take several factors into account, using a strategy akin to the 'exposure triangle' from traditional photography where a balance must be struck by considering the full context of the observation. Our method is robust and may be employed by observers to estimate best observational strategies for other targets.
Defocus to focus: Photo-realistic bokeh rendering by fusing defocus and radiance priors
Xianrui Luo, Juewen Peng, Ke Xian
et al.
We consider the problem of realistic bokeh rendering from a single all-in-focus image. Bokeh rendering mimics aesthetic shallow depth-of-field (DoF) in professional photography, but these visual effects generated by existing methods suffer from simple flat background blur and blurred in-focus regions, giving rise to unrealistic rendered results. In this work, we argue that realistic bokeh rendering should (i) model depth relations and distinguish in-focus regions, (ii) sustain sharp in-focus regions, and (iii) render physically accurate Circle of Confusion (CoC). To this end, we present a Defocus to Focus (D2F) framework to learn realistic bokeh rendering by fusing defocus priors with the all-in-focus image and by implementing radiance priors in layered fusion. Since no depth map is provided, we introduce defocus hallucination to integrate depth by learning to focus. The predicted defocus map implies the blur amount of bokeh and is used to guide weighted layered rendering. In layered rendering, we fuse images blurred by different kernels based on the defocus map. To increase the reality of the bokeh, we adopt radiance virtualization to simulate scene radiance. The scene radiance used in weighted layered rendering reassigns weights in the soft disk kernel to produce the CoC. To ensure the sharpness of in-focus regions, we propose to fuse upsampled bokeh images and original images. We predict the initial fusion mask from our defocus map and refine the mask with a deep network. We evaluate our model on a large-scale bokeh dataset. Extensive experiments show that our approach is capable of rendering visually pleasing bokeh effects in complex scenes. In particular, our solution receives the runner-up award in the AIM 2020 Rendering Realistic Bokeh Challenge.
Perfused and Nonperfused Microaneurysms Identified and Characterized by Structural and Angiographic OCT
Min Gao, Tristan T. Hormel, Yukun Guo
et al.
Purpose: Microaneurysms (MAs) have distinct, oval-shaped, hyperreflective walls on structural OCT, and inconsistent flow signal in the lumen with OCT angiography (OCTA). Their relationship to regional macular edema in diabetic retinopathy (DR) has not been quantitatively explored. Participants: A total of 99 participants, including 23 with mild, NPDR, 25 with moderate NPDR, 34 with severe NPDR, and 17 with proliferative DR. Methods: We obtained 3 x 3-mm scans with a commercial device (Solix, Visionix/Optovue) in 99 patients with DR. Trained graders manually identified MAs and their location relative to the anatomic layers from cross-sectional OCT. Microaneurysms were first classified as perfused if flow signal was present in the OCTA channel. Then, perfused MAs were further classified into fully and partially perfused MAs based on the flow characteristics in en face OCTA. The presence of retinal fluid based on OCT near MAs was compared between perfused and nonperfused types. We also compared OCT-based MA detection to fundus photography (FP)- and fluorescein angiography (FA)-based detection. Results: We identified 308 MAs (166 fully perfused, 88 partially perfused, 54 nonperfused) in 42 eyes using OCT and OCTA. Nearly half of the MAs identified in this study straddle the inner nuclear layer and outer plexiform layer. Compared with partially perfused and nonperfused MAs, fully perfused MAs were more likely to be associated with local retinal fluid. The associated fluid volumes were larger with fully perfused MAs compared with other types. OCT/OCTA detected all MAs found on FP. Although not all MAs seen with FA were identified with OCT, some MAs seen with OCT were not visible with FA or FP. Conclusions: OCT-identified MAs with colocalized flow on OCTA are more likely to be associated with DME than those without flow.
A 3DCNN-Based Knowledge Distillation Framework for Human Activity Recognition
Hayat Ullah, Arslan Munir
Human action recognition has been actively explored over the past two decades to further advancements in video analytics domain. Numerous research studies have been conducted to investigate the complex sequential patterns of human actions in video streams. In this paper, we propose a knowledge distillation framework, which distills spatio-temporal knowledge from a large teacher model to a lightweight student model using an offline knowledge distillation technique. The proposed offline knowledge distillation framework takes two models: a large pre-trained 3DCNN (three-dimensional convolutional neural network) teacher model and a lightweight 3DCNN student model (i.e., the teacher model is pre-trained on the same dataset on which the student model is to be trained on). During offline knowledge distillation training, the distillation algorithm trains only the student model to help enable the student model to achieve the same level of prediction accuracy as the teacher model. To evaluate the performance of the proposed method, we conduct extensive experiments on four benchmark human action datasets. The obtained quantitative results verify the efficiency and robustness of the proposed method over the state-of-the-art human action recognition methods by obtaining up to 35% improvement in accuracy over existing methods. Furthermore, we evaluate the inference time of the proposed method and compare the obtained results with the inference time of the state-of-the-art methods. Experimental results reveal that the proposed method attains an improvement of up to 50× in terms of frames per seconds (FPS) over the state-of-the-art methods. The short inference time and high accuracy make our proposed framework suitable for human activity recognition in real-time applications.
Photography, Computer applications to medicine. Medical informatics
Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads
Aman Goyal, Dev Agarwal, Anbumani Subramanian
et al.
In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles. Identifying and penalizing such riders is vital in curbing road accidents and improving citizens' safety. With this motivation, we propose an approach for detecting, tracking, and counting motorcycle riding violations in videos taken from a vehicle-mounted dashboard camera. We employ a curriculum learning-based object detector to better tackle challenging scenarios such as occlusions. We introduce a novel trapezium-shaped object boundary representation to increase robustness and tackle the rider-motorcycle association. We also introduce an amodal regressor that generates bounding boxes for the occluded riders. Experimental results on a large-scale unconstrained driving dataset demonstrate the superiority of our approach compared to existing approaches and other ablative variants.
ADAM Challenge: Detecting Age-related Macular Degeneration from Fundus Images
Huihui Fang, Fei Li, Huazhu Fu
et al.
Age-related macular degeneration (AMD) is the leading cause of visual impairment among elderly in the world. Early detection of AMD is of great importance, as the vision loss caused by this disease is irreversible and permanent. Color fundus photography is the most cost-effective imaging modality to screen for retinal disorders. Cutting edge deep learning based algorithms have been recently developed for automatically detecting AMD from fundus images. However, there are still lack of a comprehensive annotated dataset and standard evaluation benchmarks. To deal with this issue, we set up the Automatic Detection challenge on Age-related Macular degeneration (ADAM), which was held as a satellite event of the ISBI 2020 conference. The ADAM challenge consisted of four tasks which cover the main aspects of detecting and characterizing AMD from fundus images, including detection of AMD, detection and segmentation of optic disc, localization of fovea, and detection and segmentation of lesions. As part of the challenge, we have released a comprehensive dataset of 1200 fundus images with AMD diagnostic labels, pixel-wise segmentation masks for both optic disc and AMD-related lesions (drusen, exudates, hemorrhages and scars, among others), as well as the coordinates corresponding to the location of the macular fovea. A uniform evaluation framework has been built to make a fair comparison of different models using this dataset. During the challenge, 610 results were submitted for online evaluation, with 11 teams finally participating in the onsite challenge. This paper introduces the challenge, the dataset and the evaluation methods, as well as summarizes the participating methods and analyzes their results for each task. In particular, we observed that the ensembling strategy and the incorporation of clinical domain knowledge were the key to improve the performance of the deep learning models.
Open‐set iris recognition based on deep learning
Jie Sun, Shipeng Zhao, Sheng Miao
et al.
Abstract The existing iris recognition methods offer excellent recognition performance for known classes, but they do not consider the rejection of unknown classes. It is important to reject an unknown object class for a reliable iris recognition system. This study proposes open‐set iris recognition based on deep learning. In the method, by training the deep network, the extracted iris features are clustered near the feature centre of each kind of iris image. Then, the authors build an open‐class features outlier network (OCFON) containing distance features, which maps the features extracted by the deep network to a new feature space and classifies them. Finally, the unknown class samples are determined by a SoftMax probability threshold. The authors conducted experiments on the open iris dataset constructed using the iris datasets CASIA‐Iris‐Twins and CASIA‐Iris‐Lamp. The experiment shows that the proposed method has good open‐set iris recognition performance, can effectively distinguish iris samples of unknown classes, and has little impact on the recognition ability of known classes of iris samples.
Photography, Computer software