Vision-Language Models trained on massive collections of human-generated data often reproduce and amplify societal stereotypes. One critical form of stereotyping reproduced by these models is homogeneity bias-the tendency to represent certain groups as more homogeneous than others. We investigate how this bias responds to hyperparameter adjustments in GPT-4, specifically examining sampling temperature and top p which control the randomness of model outputs. By generating stories about individuals from different racial and gender groups and comparing their similarities using vector representations, we assess both bias robustness and its relationship with hyperparameter values. We find that (1) homogeneity bias persists across most hyperparameter configurations, with Black Americans and women being represented more homogeneously than White Americans and men, (2) the relationship between hyperparameters and group representations shows unexpected non-linear patterns, particularly at extreme values, and (3) hyperparameter adjustments affect racial and gender homogeneity bias differently-while increasing temperature or decreasing top p can reduce racial homogeneity bias, these changes show different effects on gender homogeneity bias. Our findings suggest that while hyperparameter tuning may mitigate certain biases to some extent, it cannot serve as a universal solution for addressing homogeneity bias across different social group dimensions.
In soccer, game context can result in skewing offensive statistics in ways that might misrepresent how well a team has played. For instance, in England's 1-2 loss to France in the 2022 FIFA World Cup quarterfinal, England attempted considerably more shots (16 to France's 8) and more corners (5 to 2), potentially suggesting they played better despite the loss. However, these statistics were largely accumulated when France was ahead and more willing to concede offensive initiative to England. To explore how game context influences offensive performance, we analyze minute-by-minute event-sequenced match data from 15 seasons across five major European leagues. Using count-response Generalized Additive Modeling, we consider features such as score and red card differential, home/away status, pre-match win probabilities, and game minute. Moreover, we leverage interaction terms to test several intuitive hypotheses about how these features might cooperate in explaining offensive production. The selected model is then applied to project offensive statistics onto a standardized "common denominator" scenario: a tied home game with even men on both sides. The adjusted numbers - in contrast to regular game totals that disregard game context - offer a more contextualized comparison, reducing the likelihood of misrepresenting the relative quality of play.
2020 through 2023 were unusually tumultuous years for children in the United States, and children's welfare was prominent in political debate. Theories in moral psychology suggest that political parties would treat concerns for children using different moral frames, and that moral conflict might drive substantial polarization in discussions about children. However, such partisan frames may still differ very little if there is limited underlying disagreement about moral issues and everyday concerns in childhood when not explicitly referencing politics. We evaluate claims of universality and division in moral language using tweets from 2019-2023 linked to U.S. voter records, focusing on expressed morality. Our results show that mentions of children by Republicans and Democrats are usually similar, differing no more than mentions by women and men, and tend to contain no large differences in accompanying moral words. To the extent that mentions of children did differ across parties, these differences were constrained to topics polarized well before the pandemic -- and slightly heightened when co-mentioned with `kids' or `children'. These topics reflected a small fraction of conversations about children. Overall, polarization of online discussion around childhood appears to reflect escalated polarization on lines of existing partisan conflicts rather than concerns originating from new concerns about the welfare of children during and after the pandemic.
The success of collaborative instruction in helping students achieve higher grades in introductory science, technology, engineering, and mathematics (STEM) courses has led many educators and researchers to assume these methods also address inequities. However, little evidence tests this assumption. Structural inequities in our society have led to the chronic underrepresentation of Black, Hispanic, women, and first-generation students in STEM disciplines. Broadening participation from underrepresented groups in biology, chemistry, and physics would reduce social inequalities while harnessing diversity's economic impact on innovation and workforce expansion. We leveraged data on content knowledge from 18,791 students in 305 introductory courses using collaborative instruction at 45 institutions. We modeled student outcomes across the intersections of gender, race, ethnicity, and first-generation college status within and across science disciplines. Using these models, we examine the educational debts society owes college science students prior to instruction and whether instruction mitigates, perpetuates, or exacerbates those debts. The size of these educational debts and the extent to which courses added to or repaid these debts varied across disciplines. Across all three disciplines, society owed Black and Hispanic women and first-generation Black men the largest educational debts. Collaborative instructional strategies were not sufficient to repay society's educational debts.
In this research, the contributions of a highly productive minority of scientists to the national Polish research output over the past three decades (1992-2021) is explored. In almost all previous research, the approaches to high research productivity are missing the time component. Cross-sectional studies were not complemented by longitudinal studies: Scientists comprising the classes of top performers have not been tracked over time. Three classes of top performers (the upper 1%, 5%, and 10%) are examined, and a surprising temporal stability of productivity patterns is found. The 1/10 and 10/50 rules consistently apply across the three decades: The upper 1% of scientists, on average, account for 10% of the national output, and the upper 10% account for almost 50% of total output, with significant disciplinary variations. The Relative Presence Index (RPI) we constructed shows that men are overrepresented and women underrepresented in all top performers classes. Top performers are studied longitudinally through their detailed publishing histories, with micro-data coming from the raw Scopus dataset. Econometric models identify the three most important predictors that change the odds ratio estimates of membership in the top performance classes: gender, academic age, and research collaboration. The downward trend in fixed effects over successive six-year periods indicates increasing competition in academia. A large population of all internationally visible Polish scientists (N=152,043) with their 587,558 articles is studied.
Anil B. Gavade, Neel Kanwal, Priyanka A. Gavade
et al.
Prostate cancer (PCa) is a severe disease among men globally. It is important to identify PCa early and make a precise diagnosis for effective treatment. For PCa diagnosis, Multi-parametric magnetic resonance imaging (mpMRI) emerged as an invaluable imaging modality that offers a precise anatomical view of the prostate gland and its tissue structure. Deep learning (DL) models can enhance existing clinical systems and improve patient care by locating regions of interest for physicians. Recently, DL techniques have been employed to develop a pipeline for segmenting and classifying different cancer types. These studies show that DL can be used to increase diagnostic precision and give objective results without variability. This work uses well-known DL models for the classification and segmentation of mpMRI images to detect PCa. Our implementation involves four pipelines; Semantic DeepSegNet with ResNet50, DeepSegNet with recurrent neural network (RNN), U-Net with RNN, and U-Net with a long short-term memory (LSTM). Each segmentation model is paired with a different classifier to evaluate the performance using different metrics. The results of our experiments show that the pipeline that uses the combination of U-Net and the LSTM model outperforms all other combinations, excelling in both segmentation and classification tasks.
Joshua R. Minot, Marc Maier, Bradford Demarest
et al.
Over the past decade, the gender pay gap has remained steady with women earning 84 cents for every dollar earned by men on average. Many studies explain this gap through demand-side bias in the labor market represented through employers' job postings. However, few studies analyze potential bias from the worker supply-side. Here, we analyze the language in millions of US workers' resumes to investigate how differences in workers' self-representation by gender compare to differences in earnings. Across US occupations, language differences between male and female resumes correspond to 11% of the variation in gender pay gap. This suggests that females' resumes that are semantically similar to males' resumes may have greater wage parity. However, surprisingly, occupations with greater language differences between male and female resumes have lower gender pay gaps. A doubling of the language difference between female and male resumes results in an annual wage increase of $2,797 for the average female worker. This result holds with controls for gender-biases of resume text and we find that per-word bias poorly describes the variance in wage gap. The results demonstrate that textual data and self-representation are valuable factors for improving worker representations and understanding employment inequities.
Gregory Holste, Douwe van der Wal, Hans Pinckaers
et al.
Prostate cancer is one of the leading causes of cancer-related death in men worldwide. Like many cancers, diagnosis involves expert integration of heterogeneous patient information such as imaging, clinical risk factors, and more. For this reason, there have been many recent efforts toward deep multimodal fusion of image and non-image data for clinical decision tasks. Many of these studies propose methods to fuse learned features from each patient modality, providing significant downstream improvements with techniques like cross-modal attention gating, Kronecker product fusion, orthogonality regularization, and more. While these enhanced fusion operations can improve upon feature concatenation, they often come with an extremely high learning capacity, meaning they are likely to overfit when applied even to small or low-dimensional datasets. Rather than designing a highly expressive fusion operation, we propose three simple methods for improved multimodal fusion with small datasets that aid optimization by generating auxiliary sources of supervision during training: extra supervision, clinical prediction, and dense fusion. We validate the proposed approaches on prostate cancer diagnosis from paired histopathology imaging and tabular clinical features. The proposed methods are straightforward to implement and can be applied to any classification task with paired image and non-image data.
Kanan Mahammadli, Abdullah Burkan Bereketoglu, Ayse Gul Kabakci
Breast Cancer is the most common cancer among women, which is also visible in men, and accounts for more than 1 in 10 new cancer diagnoses each year. It is also the second most common cause of women who die from cancer. Hence, it necessitates early detection and tailored treatment. Early detection can provide appropriate and patient-based therapeutic schedules. Moreover, early detection can also provide the type of cyst. This paper employs class-level data augmentation, addressing the undersampled classes and raising their detection rate. This approach suggests two key components: class-level data augmentation on structure-preserving stain normalization techniques to hematoxylin and eosin-stained images and transformer-based ViTNet architecture via transfer learning for multiclass classification of breast cancer images. This merger enables categorizing breast cancer images with advanced image processing and deep learning as either benign or as one of four distinct malignant subtypes by focusing on class-level augmentation and catering to unique characteristics of each class with increasing precision of classification on undersampled classes, which leads to lower mortality rates associated with breast cancer. The paper aims to ease the duties of the medical specialist by operating multiclass classification and categorizing the image into benign or one of four different malignant types of breast cancers.
Honghan Wu, Minhong Wang, Aneeta Sylolypavan
et al.
AI technologies are being increasingly tested and applied in critical environments including healthcare. Without an effective way to detect and mitigate AI induced inequalities, AI might do more harm than good, potentially leading to the widening of underlying inequalities. This paper proposes a generic allocation-deterioration framework for detecting and quantifying AI induced inequality. Specifically, AI induced inequalities are quantified as the area between two allocation-deterioration curves. To assess the framework's performance, experiments were conducted on ten synthetic datasets (N>33,000) generated from HiRID - a real-world Intensive Care Unit (ICU) dataset, showing its ability to accurately detect and quantify inequality proportionally to controlled inequalities. Extensive analyses were carried out to quantify health inequalities (a) embedded in two real-world ICU datasets; (b) induced by AI models trained for two resource allocation scenarios. Results showed that compared to men, women had up to 33% poorer deterioration in markers of prognosis when admitted to HiRID ICUs. All four AI models assessed were shown to induce significant inequalities (2.45% to 43.2%) for non-White compared to White patients. The models exacerbated data embedded inequalities significantly in 3 out of 8 assessments, one of which was >9 times worse. The codebase is at https://github.com/knowlab/DAindex-Framework.
Multi-state models are increasingly being used to model complex epidemiological and clinical outcomes over time. It is common to assume that the models are Markov, but the assumption can often be unrealistic. The Markov assumption is seldomly checked and violations can lead to biased estimation for many parameters of interest. As argued by Datta and Satten (2001), the Aalen-Johansen estimator of occupation probabilities is consistent also in the non-Markov case. Putter and Spitoni (2018) exploit this fact to construct a consistent estimator of state transition probabilities, the landmark Aalen-Johansen estimator, which does not rely on the Markov assumption. A disadvantage of landmarking is data reduction, leading to a loss of power. This is problematic for less traveled transitions, and undesirable when such transitions indeed exhibit Markov behaviour. Using a framework of partially non-Markov multi-state models we suggest a hybrid landmark Aalen-Johansen estimator for transition probabilities. The proposed estimator is a compromise between regular Aalen-Johansen and landmark estimation, using transition specific landmarking, and can drastically improve statistical power. The methods are compared in a simulation study and in a real data application modelling individual transitions between states of sick leave, disability, education, work and unemployment. In the application, a birth cohort of 184951 Norwegian men are followed for 14 years from the year they turn 21, using data from national registries.
Coronavirus disease 2019 (COVID-19) has triggered a worldwide outbreak of pandemic, and transportation services have played a key role in coronavirus transmission. Although not crowded in a confined space like a bus or a metro car, bike sharing users will be exposed to the bike surface and take the transmission risk. During the COVID-19 pandemic, how to meet user demand and avoid virus spreading has become an important issue for bike sharing. Based on the trip data of bike sharing in Nanjing, China, this study analyzes the travel demand and operation management before and after the pandemic outbreak from the perspective of stations, users, and bikes. Semi-logarithmic difference-in-differences model, visualization methods, and statistic indexes are applied to explore the transportation service and risk prevention of bike sharing during the pandemic. The results show that pandemic control strategies sharply reduced user demand, and commuting trips decreased more significantly. Some stations around health and religious places become more important. Men and older adults are more dependent on bike sharing systems. Besides, the trip decrease reduces user contact and increases idle bikes. And a new concept of user distancing is proposed to avoid transmission risk and activate idle bikes. This study evaluates the role of shared micro-mobility during the COVID-19 pandemic, and also inspires the blocking of viral transmission within the city.
The progesterone receptor (PR) mediates progesterone regulation of female reproductive physiology, as well as gene transcription in non-reproductive tissues, such as brain, bone, lung and vasculature, in both women and men. An unusual property of progesterone is its high affinity for the mineralocorticoid receptor (MR), which regulates electrolyte transport in the kidney in humans and other terrestrial vertebrates. In humans, rats, alligators and frogs, progesterone antagonizes activation of the MR by aldosterone, the physiological mineralocorticoid in terrestrial vertebrates. In contrast, in elephant shark, ray-finned fishes and chickens, progesterone activates the MR. Interestingly, cartilaginous fishes and ray-finned fishes do not synthesize aldosterone, raising the question of which steroid(s) activate the MR in cartilaginous fishes and ray-finned fishes. The simpler synthesis of progesterone, compared to cortisol and other corticosteroids, makes progesterone a candidate physiological activator of the MR in elephant sharks and ray-finned fishes. Elephant shark and ray-finned fish MRs are expressed in diverse tissues, including heart, brain and lung, as well as, ovary and testis, two reproductive tissues that are targets for progesterone, which together suggests a multi-faceted physiological role for progesterone activation of the MR in elephant shark and ray-finned fish. The functional consequences of progesterone as an antagonist of some terrestrial vertebrate MRs and as an agonist of fish and chicken MRs are not fully understood. Indeed, little is known of physiological activities of progesterone via any vertebrate MR.
Suppose you are told that taking a statin will reduce your risk of a heart attack or stroke by 3% in the next ten years, or that women have better emotional intelligence than men. You may wonder how accurate the 3% is, or how confident we should be about the assertion about women's emotional intelligence, bearing in mind that these conclusions are only based on samples of data? My aim here is to present two statistical approaches to questions like these. Approach 1 is often called null hypothesis testing but I prefer the phrase "baseline hypothesis": this is the standard approach in many areas of inquiry but is fraught with problems. Approach 2 can be viewed as a generalisation of the idea of confidence intervals, or as the application of Bayes' theorem. Unlike Approach 1, Approach 2 provides a tentative estimate of the probability of hypotheses of interest. For both approaches, I explain, from first principles, building only on "common sense" statistical concepts like averages and randomness, both how to derive answers, and the rationale behind the answers. This is achieved by using computer simulation methods (resampling and bootstrapping using a spreadsheet available on the web) which avoid the use of probability distributions (t, normal, etc). Such a minimalist, but reasonably rigorous, analysis is particularly useful in a discipline like statistics which is widely used by people who are not specialists. My intended audience includes both statisticians, and users of statistical methods who are not statistical experts.
Shear stress plays an important role in the creation and evolution of atherosclerosis. An key element for in-vivo measurements and extrapolations is the dependence of shear stress on body mass. In the case of a Poiseuille modeling of the blood flow, P. Weinberg and C. Ethier have shown that shear stress on the aortic endothelium varies like body mass to the power $-\frac{3}{8}$, and is therefore 20-fold higher in mice than in men. However, by considering a more physiological oscillating Poiseuille + Womersley combinated flow in the aorta, we show that results differ notably: at larger masses ($M>10 \ kg$) shear stress varies as body mass to the power $-\frac{1}{8}$ and modifies the man to mouse ratio to 1:8. The allometry and values of temporal gradient of shear stress also change: $\partialτ/\partial t$ varies as $M^{-3/8}$ instead of $M^{-5/8}$ at larger masses, and the 1:150 ratio from man to mouse becomes 1:61. Lastly, we show that the unsteady component of blood flow does not influence the constant allometry of peak velocity on body mass: $u_{max} \propto M^{0}$. This work extends our knowledge on the dependence of hemodynamic parameters on body mass and paves the way for a more precise extrapolation of in-vivo measurements to humans and bigger mammals.
Claire Meyniel, Dalila Samri, Farah Stefano
et al.
We evaluated the cognitive status of visually impaired patients referred to low vision rehabilitation (LVR) based on a standard cognitive battery and a new evaluation tool, named the COGEVIS, which can be used to assess patients with severe visual deficits. We studied patients aged 60 and above, referred to the LVR Hospital in Paris. Neurological and cognitive evaluations were performed in an expert memory center. Thirty-eight individuals, 17 women and 21 men with a mean age of 70.3 $\pm$ 1.3 years and a mean visual acuity of 0.12 $\pm$ 0.02, were recruited over a one-year period. Sixty-three percent of participants had normal cognitive status. Cognitive impairment was diagnosed in 37.5% of participants. The COGEVIS score cutoff point to screen for cognitive impairment was 24 (maximum score of 30) with a sensitivity of 66.7% and a specificity of 95%. Evaluation following 4 months of visual rehabilitation showed an improvement of Instrumental Activities of Daily Living (p = 0 004), National Eye Institute Visual Functioning Questionnaire (p = 0 035), and Montgomery-Åsberg Depression Rating Scale (p = 0 037). This study introduces a new short test to screen for cognitive impairment in visually impaired patients.
Adrienne Traxler, Rachel Henderson, John Stewart
et al.
Research on the test structure of the Force Concept Inventory (FCI) has largely ignored gender, and research on FCI gender effects (often reported as "gender gaps") has seldom interrogated the structure of the test. These rarely-crossed streams of research leave open the possibility that the FCI may not be structurally valid across genders, particularly since many reported results come from calculus-based courses where 75% or more of the students are men. We examine the FCI considering both psychometrics and gender disaggregation (while acknowledging this as a binary simplification), and find several problematic questions whose removal decreases the apparent gender gap. We analyze three samples (total $N_{pre}=5,391$, $N_{post}=5,769$) looking for gender asymmetries using Classical Test Theory, Item Response Theory, and Differential Item Functioning. The combination of these methods highlights six items that appear substantially unfair to women and two items biased in favor of women. No single physical concept or prior experience unifies these questions, but they are broadly consistent with problematic items identified in previous research. Removing all significantly gender-unfair items halves the gender gap in the main sample in this study. We recommend that instructors using the FCI report the reduced-instrument score as well as the 30-item score, and that credit or other benefits to students not be assigned using the biased items.
In recent years, the striking gender imbalance in the physical sciences has been a topic for much debate. National bodies and professional societies in the astronomical and space sciences are now taking active steps to understand and address this imbalance. In order to begin this process in the Australian Space Research community, we must first understand the current state of play. In this work, we therefore present a short 'snapshot' of the current gender balance in our community, as observed at the 15th Australian Space Research Conference. We find that, at this year's conference, male attendees outnumbered female attendees by a ratio of 3:1 (24% female). This gender balance was repeated in the distribution of conference talks and plenary presentations (25 and 22% female, respectively). Of the thirteen posters presented at the conference, twelve were presented by men (92%), a pattern repeated in the awards for the best student presentations (seven male recipients vs one female). The program and organising committees for the meeting fairly represented the gender balance of the conference attendees (28% and 30% female, respectively). These figures provide a baseline for monitoring future progress in increasing the participation of women in the field. They also suggest that the real barrier is not speaking, but in enabling conference attendance and retaining female scientists through their careers - in other words, addressing and repairing the 'leaky pipeline'.