Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.
With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel fine-grained joint search strategy for transformer models that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve 3.5% higher top-1 accuracy than MobileNetV2 on ImageNet-1K with similar latency and parameters. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed 1.
Why does the Hilton Hotel look different from a coaching inn - because of changes in architectural taste or changes in the hotel business? The study of building types is a history of institutions - official, social and commercial; of their changing architectural requirements; and of the way in which these requirements have been met. The building types covered here are national monuments, government buildings, theatres, libraries, museums, hospitals, prisons, hotels, exchanges and banks, warehouses and offices, railway stations, market halls and exhibition buildings, shops and department stores, and factories. Now available again, this winner of the Wolfson Literary Award in 1976 is an expanded version of the A.W. Mellon Lectures in the Fine Arts delivered at Washington in 1970. It was Pevsner's major preoccupation after the monumental Buildings of England series came to an end.
While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.
We introduce an effective and efficient method that grounds (i.e., localizes) natural sentences in long, untrimmed video sequences. Specifically, a novel Temporal GroundNet (TGN) is proposed to temporally capture the evolving fine-grained frame-by-word interactions between video and sentence. TGN sequentially scores a set of temporal candidates ended at each frame based on the exploited frame-by-word interactions, and finally grounds the segment corresponding to the sentence. Unlike traditional methods treating the overlapping segments separately in a sliding window fashion, TGN aggregates the historical information and generates the final grounding result in one single pass. We extensively evaluate our proposed TGN on three public datasets with significant improvements over the state-of-the-arts. We further show the consistent effectiveness and efficiency of TGN through an ablation study and a runtime test.
We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.
NORSK: I dette kunstneriske utviklingsarbeidet undersøker jeg hvordan norsk folkemusikk kan integreres i mitt etablerte musikalske sound. Gjennom lytting, refleksjon og skapende praksis har jeg latt tradisjonsmusikkens estetikk og uttrykk påvirke mitt eget formspråk. Arbeidet har resultert i albumet "Relasjonsorkester", der jeg utforsker møtepunktene mellom tradisjon og samtid i tre utvalgte låter: "Hei, hallo", "Tankerom" og "Langsiktig sparing". Prosessen har vist at respektfull lytting, tilegning av teoretisk kunnskap og bevisste kunstneriske valg er avgjørende for å forankre nye uttrykk i en levende tradisjon. Oppgaven reflekterer over hvordan tradisjonen kan bli en del av egen kunstnerisk identitet, uten å måtte bli en tradisjonsbærer i klassisk forstand. ENGLISH: This artistic research project explores how Norwegian folk music can be integrated into my established musical sound. Through listening, reflection, and creative practice, I have allowed the aesthetics and expression of folk tradition to influence my own musical language. The project resulted in the album "Relasjonsorkester", where I explore the intersections between tradition and contemporary music through three selected songs: “Hei, hallo”, “Tankerom”, and “Langsiktig sparing”. The process demonstrates that respectful listening, acquisition of theoretical knowledge, and conscious artistic choices are essential for grounding new expressions within a living tradition. The thesis reflects on how tradition can become a part of one’s artistic identity without necessarily becoming a traditional bearer in the conventional sense. keywords:
We demonstrate that the update of weight matrices in learning algorithms can be described in the framework of Dyson Brownian motion, thereby inheriting many features of random matrix theory. We relate the level of stochasticity to the ratio of the learning rate and the mini-batch size, providing more robust evidence to a previously conjectured scaling relationship. We discuss universal and non-universal features in the resulting Coulomb gas distribution and identify the Wigner surmise and Wigner semicircle explicitly in a teacher-student model and in the (near-)solvable case of the Gaussian restricted Boltzmann machine.
Hamida Beg , Muhammad Ferdowsieh, Fatemeh Hosseini
رغم وضوح أهمية صحة الإنسان، إلا أننا نجد مسألة إضراربنفس قد انتشرت بشكل واسع، والتي يعتبرها البعض جائزة بناء على حكم الملكية وكون الإنسان له الحق في ذاته. السؤال الأساسي الذي يطرحه المقال في مجال إضراربنفس هو الحكم الديني على مختلف المستويات الذي استنتجه فقهاء الإمامية بالأدلة العقلية والنقلية.
قد توصل هذا البحث الذي كتب بطريقة توصيفية تحليلية إلى النتائج التالية: إضراربنفس درجات تشمل الانتحار، والبتر، وفقدان إحدى الحواس، والتعرض للمرض، والضرر البسيط، والإهانة. يحرم القتل بالآيات والأحاديث وعقلا. إلا إذا كان للدفاع عن النفس أو الجهاد وبإذن الولي. كما تحرم الشريعة البتر, لأنه يعرض الإنسان للهلاك ما لم يكن لذلك سبب منطقي وطبي. أما المستويات الأخرى فهناك اختلاف بين الفقهاء عليها.
History of scholarship and learning. The humanities, Arts in general
The subjects discussed in the article are the curious use of the term creativity by neoliberalism, how this has had an effect on educational policies in recent years, and how this effect can be counteracted by suggesting an option in a critical educational model. We characterize the concept of creativity applying a multidisciplinary approach with an analytical foundation that diverges with the perceptions coined by the business speech in the latest decades. Because of this disparity, we question the manner in which pedagogy ought to use creativity for the best development of students’ critical capacity, which allows them to criticize the strategies and the objectives of their learning process, as well as willingly animate their subjectivity and foment their critical thought. In relation to the methodology, for all the phases of this documentary research project, we have worked with a bibliographical review of sources which, due to their concrete theme and academic impact, are more relevant for the idea and purposes.
<p>Advances in photogrammetric surveys and immersive visualization through Virtual Reality (VR) have brought about a revolution in all areas of architectural representation. Architectural heritage has not escaped this revolution, since these techniques make possible the recovery of heritage elements that are currently missing or decontextualized, which otherwise would not be possible to contemplate. This article develops a paradigmatic case in this field such as the recovery of the Gothic choir of the Cathedral of Girona, completely dismantled and partially preserved in different locations of the Spanish geography. Based on previous work of recontextualization of similar environments carried out by the authors (Sánchez Riera et al., 2022a), this paper describes the construction of a virtual and hyperrealistic immersive model of the missing ensemble and its original context. The geometric and luminic integration of both models is shown in a self-developed application and different scenes built by means of its use are exemplified. The proposal allows the interactive visualization of the complex by switching between its current and original states. The possibility of consulting additional information on certain singular elements, now disappeared or decontextualized, in the very location for which they were originally conceived, aims to promote their study and dissemination. It will also make it possible to establish hypotheses about their original design, use and configuration, ultimately promoting the conservation of architectural heritage that would otherwise remain forgotten</p><p>DOI: https://doi.org/10.20365/disegnarecon.30.2023.8</p>
The preferences and semiotic roots of digital nomads are considered, and the current state of globalization processes is characterized. Glocalization is viewed as a new paradigm of architecture that increases the dynamics of mobility and universality in the typological patterns of architecture of urban (global) environment. Objects of digital nomads like “third places”, “virtual fires”, WI-FI in infrastructure as elements of eco-digital cities will need tools similar to the controlled chaos.
The traditional consideration of architectural history as a linear process, where one style successively replaces another, still dominates, but does not explain many phenomena. V.S. Goryunov substantiated a fundamentally different, non-linear model of the historical process in architecture on the basis of architecture of the 19th – early 20th centuries. V. S. Goryunov and M. P. Tubli proposed a stylistic polyphony of architecture with the introduction of the concepts of “Art Nouveau period” and “anti-eclectic movement”. In October 2021, a collection of articles by V. S. Goryunov entitled “Theoretical Problems of the History of Architecture” was published. The articles in this collection represent the main theoretical ideas of the scientist.
The article focuses on the representation of striptease in Italy in the late 50s. While the unveiling of the female body became more and more pervasive - in posters, in magazines, in theaters, in cinematography - Italian moralists sought strategies to contain the phenomenon. At the same time, many courts issued acquittal sentences which, in fact, liberalized striptease. At the center of the dispute was not female nudity (which had been showed since at least a decade) but rather the rules of behavior of women, disputed between the will to control and the desire to look. In both cases, considered as an object of the male will.
This paper describes an extension to the DMN standard, called cDMN. It aims to enlarge the expressivity of DMN in order to solve more complex problems, while retaining DMN's goal of being readable by domain experts. We test cDMN by solving the most complex challenges posted on the DM Community website. We compare our own cDMN solutions to the solutions that have been submitted to the website and find that our approach is competitive, both in readability and compactness. Moreover, cDMN is able to solve more challenges than any other approach.
En este artículo se analiza, desde el enfoque glotopolítico, una serie de cuatro obras lexicográficas monolingües del español de la Argentina publicadas entre 1890 y 1903, que se agrupan en la categoría “diccionarios de barbarismos”. Se trata de instrumentos lingüísticos que se arrogan un carácter prescriptivo puesto que incluyen y, por extensión, excluyen ciertos usos y determinados vocablos, además de describir, calificar y valorar las divergencias léxicas entre el uso americano (por caso el argentino) y el peninsular en el momento de mayor auge del movimiento migratorio masivo que ingresó al país. A grandes rasgos, estos instrumentos lingüísticos dan cuenta de barbarismos, neologismos y extranjerismos y censuran su uso, tomando como parámetro la norma del castellano de Madrid. Aquí se argumenta que dichos dispositivos exponen sistemas de valores y concepciones dominantes sobre la lengua, que exceden ampliamente el ámbito de lo estrictamente lingüístico.
The city of Padua is surveyed by technicians coordinated by Giovanni Valle, who elaborates, draws and ‘amends’, under the direction of Simone Stratico –whose role has not yet been clarified [1]– a map that was engraved in Rome, within 1784, by Giovanni Volpato. For technical reasons the huge surface of the drawing is transfered on 20 copper plates of various sizes [Ghironi, Mazzi 1985] becoming, in its original edition, one of the most important and famous maps of the city. Realized according to precise criteria of geometric measurement, it can be considered, therefore, a punctual record of the urban condition of Padua in the 18th century [Zaggia 2009]. For this reason, the map plays a role of great historical importance since it crystallizes the forma urbis of Padua before the wide nineteenth and twentieth century transformations (read more).