Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization
Pengcheng He, Baolin Peng, Liyang Lu
et al.
This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.
Moore's Law is dead, long live Moore's Law!
Nick Zhang
Moore's Law has been used by semiconductor industry as predicative indicators of the industry and it has become a self-fulfilling prophecy. Now more people tend to agree that the original Moore's Law started to falter. This paper proposes a possible quantitative modification to Moore's Law. It can cover other derivative laws of Moore's Law as well. It intends to more accurately predict the roadmap of chip's performance and energy consumption.
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Pengcheng He, Jianfeng Gao, Weizhu Chen
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at https://github.com/microsoft/DeBERTa.
Edsger W. Dijkstra: a Commemoration
Krzysztof R. Apt, Tony Hoare
This article is a multiauthored portrait of Edsger Wybe Dijkstra that consists of testimonials written by several friends, colleagues, and students of his. It provides unique insights into his personality, working style and habits, and his influence on other computer scientists, as a researcher, teacher, and mentor.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao
et al.
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).
Edsger Dijkstra. The Man Who Carried Computer Science on His Shoulders
Krzysztof R. Apt
This a biographical essay about Edsger Wybe Dijkstra.
Life, The Mind, and Everything
Gary R. Prok
Incompleteness theorems of Godel, Turing, Chaitin, and Algorithmic Information Theory have profound epistemological implications. Incompleteness limits our ability to ever understand every observable phenomenon in the universe. Incompleteness limits the ability of evolutionary processes from finding optimal solutions. Incompleteness limits the detectability of machine consciousness. This is an effort to convey these thoughts and results in a somewhat entertaining manner.
Les connaissances de la toile
Serge Abiteboul
How to manage knowledge on the Web.
Software Carpentry: Lessons Learned
Greg Wilson
Over the last 15 years, Software Carpentry has evolved from a week-long training course at the US national laboratories into a worldwide volunteer effort to raise standards in scientific computing. This article explains what we have learned along the way the challenges we now face, and our plans for the future.
Removing Barriers to Interdisciplinary Research
Naomi Jacobs, Martyn Amos
A significant amount of high-impact contemporary scientific research occurs where biology, computer science, engineering and chemistry converge. Although programmes have been put in place to support such work, the complex dynamics of interdisciplinarity are still poorly understood. In this paper we interrogate the nature of interdisciplinary research and how we might measure its "success", identify potential barriers to its implementation, and suggest possible mechanisms for removing these impediments.
The non-anticipation of the asynchronous systems
Serban E. Vlad
The asynchronous systems are the models of the asynchronous circuits from the digital electrical engineering and non-anticipation is one of the most important properties in systems theory. Our present purpose is to introduce several concepts of non-anticipation of the asynchronous systems.
Stop That Subversive Spreadsheet!
David Chadwick
This paper documents the formation of the European Spreadsheet Risks Interest Group (EuSpRIG www.eusprig.org) and outlines some of the research undertaken and reported upon by interested parties in EuSpRIG publications
Edsger Wybe Dijkstra (1930 -- 2002): A Portrait of a Genius
Krzysztof R. Apt
We discuss the scientific contributions of Edsger Wybe Dijkstra, his opinions and his legacy.
Ten Incredibly Dangerous Software Ideas
G. A. Maney
This is a rough draft synopsis of a book presently in preparation. This book provides a systematic critique of the software industry. This critique is accomplished using classical methods in practical design science.
One More Revolution to Make: Free Scientific Publishing
Krzysztof R. Apt
Computer scientists are in the position to create new, free high-quality journals. So what would it take?
Methods for scaling a large member base
Nathan Boeger
The technical challenges of scaling websites with large and growing member bases, like social networking sites, are numerous. One of these challenges is how to evenly distribute the growing member base across all available resources. This paper will explore various methods that address this issue. The techniques used in this paper can be generalized and applied to various other problems that need to distribute data evenly amongst a finite amount of resources.
Multiple-Size Divide-and-Conquer Recurrences
Ming-Yang Kao
This short note reports a master theorem on tight asymptotic solutions to divide-and-conquer recurrences with more than one recursive term: for example, T(n) = 1/4 T(n/16) + 1/3 T(3n/5) + 4 T(n/100) + 10 T(n/300) + n^2.
Real Time Models of the Asynchronous Circuits: The Delay Theory
Serban E. Vlad
The chapter from the book introduces the delay theory, whose purpose is the modeling of the asynchronous circuits from digital electrical engineering with ordinary and differential pseudo-boolean equations.
Theory and practice
Donald E. Knuth
The author argues to Silicon Valley that the most important and powerful part of computer science is work that is simultaneously theoretical and practical. He particularly considers the intersection of the theory of algorithms and practical software development. He combines examples from the development of the TeX typesetting system with clever jokes, criticisms, and encouragements.
Some first thoughts on the stability of the asynchronous systems
Serban E. Vlad
The (non-initialized, non-deterministic) asynchronous systems (in the input-output sense) are multi-valued functions from m-dimensional signals to sets of n-dimensional signals, the concept being inspired by the modeling of the asynchronous circuits. Our purpose is to state the problem of the their stability.