We present a novel approach for taxonomic analysis of chloroplast genomes in angiosperms using the Pan-genome Research Toolkit (PGR-TK). Comparative plots generated by PGR-TK across diverse angiosperm genera reveal a wide range of structural complexity, from straightforward to highly intricate patterns. Notably, the characteristic quadripartite plastome structure, comprising the large single copy (LSC), small single copy (SSC), and inverted repeat (IR) regions, is clearly identifiable in over 75% of the genera analyzed. Our findings also underscore several occurrences of species mis-annotations in public genomic databases, which are readily detected through visual anomalies in the PGR-TK plots. While more complex plot patterns remain difficult to interpret, they likely reflect underlying biological variation or technical inconsistencies in genome assembly. Overall, this approach effectively integrates classical botanical visualization with modern molecular taxonomy, providing a powerful tool for genome-based classification in plant systematics.
Smoking addiction remains a significant public health issue in Indonesia, and accurate measurement of smoking behavior is essential to support effective interventions. The Glover–Nilsson Smoking Behaviour Questionnaire(GN-SBQ) is designed to assess psychological and behavioral aspects of tobacco dependence but has not yet been validated in an Indonesian-language version. This study aimed to examine the validity and reliability of the Indonesian version of the GN-SBQ as a tool for measuring behavioral smoking dependence. A quantitative cross-sectional study was conducted with 102 active smokers aged ≥40 years. Cultural adaptation of the GN-SBQ involved forward-backward translation, pre-testing, and cognitive interviews. Content validity was evaluated using the Content Validity Index (CVI), construct validity was assessed through exploratory factor analysis (EFA), and internal consistency was measured using Cronbach’s alpha. All items demonstrated I-CVI values ≥ 0.8, and the overall scale S-CVI/Ave was 0.93. The Kaiser-Meyer-Olkin value was 0.860, and Bartlett’s Test of Sphericity was significant (p < 0.001), confirming sampling adequacy for factor analysis. EFA identified two main components—habitual smoking behavior and emotional/situational response—that together explained 63% of the total variance. All items had factor loadings above 0.4. The overall Cronbach’s alpha coefficient was 0.767, indicating acceptable internal consistency. The Indonesian version of the GN-SBQ is a valid and reliable instrument for assessing behavioral aspects of smoking dependence. It is suitable for use in clinical practice and research settings in Indonesia and can support more comprehensive and culturally relevant smoking cessation interventions. ABSTRAK Kecanduan merokok merupakan permasalahan kesehatan masyarakat yang signifikan di Indonesia, dan pengukuran yang akurat terhadap dimensi perilaku merokok diperlukan untuk mendukung intervensi yang efektif. Glover–Nilsson Smoking Behaviour Questionnaire (GN-SBQ) merupakan instrumen yang dirancang untuk menilai aspek psikologis dan perilaku dari kecanduan merokok, namun belum tersedia dalam versi Bahasa Indonesia yang tervalidasi. Penelitian ini bertujuan untuk melakukan uji validitas dan reliabilitas GN-SBQ versi Bahasa Indonesia sebagai alat ukur kecanduan merokok berbasis perilaku. Penelitian ini menggunakan pendekatan kuantitatif dengan melibatkan 102 perokok aktif berusia ≥40 tahun. Proses adaptasi budaya dilakukan melalui tahapan forward-backward translation, pre-testing, dan wawancara kognitif. Uji validitas mencakup validitas isi (Content Validity Index), validitas konstruk melalui analisis faktor eksploratori (exploratory factor analysis), serta uji reliabilitas menggunakan koefisien Cronbach’s Alpha. Hasil menunjukkan bahwa semua item memiliki I-CVI ≥ 0,8 dan S-CVI/Ave sebesar 0,93. Nilai KMO sebesar 0,860 dan Bartlett’s Test signifikan (p < 0,001), menunjukkan data layak untuk analisis faktor. EFA mengidentifikasi dua faktor utama yang menjelaskan 63% total varians, yaitu dimensi kebiasaan merokok dan respons emosional/situasional. Seluruh item memiliki factor loading > 0,4. Nilai Cronbach’s Alpha sebesar 0,767 menunjukkan reliabilitas internal yang baik. GN-SBQ versi Bahasa Indonesia terbukti valid dan reliabel untuk mengukur kecanduan merokok berbasis perilaku. Instrumen ini layak digunakan dalam penelitian dan praktik klinis di Indonesia, serta dapat mendukung upaya intervensi penghentian merokok yang lebih komprehensif dan kontekstual.
Parthasarathi Sahu, Sashikanta Barik, Koushik Ghosh
et al.
Locations of DNA replication initiation in prokaryotes, called "origins of replication", are well-characterized. However, a mechanistic understanding of the sequence-dependence of the local unzipping of double-stranded DNA, the first step towards replication initiation, is lacking. Here, utilizing a Markov chain model that was created to address the directional nature of DNA unzipping and replication, we model the sequence dependence of local melting of double-stranded linear DNA segments. We show that generalized palindromic sequences with high nucleotide skews have a low kinetic barrier for local melting near melting temperatures. This allows for such sequences to function as replication origins. We support our claim with evidence for high-skew palindromic sequences within the replication origins of mitochondrial DNA, bacteria, archaea and plasmids.
Sheetal Prakash Silal, Juliet R. C. Pulliam, Gesine Meyer-Rath
et al.
In March 2020 the South African COVID-19 Modelling Consortium was formed to support government planning for COVID-19 cases and related healthcare. Models were developed jointly by local disease modelling groups to estimate cases, resource needs and deaths due to COVID-19. The National COVID-19 Epi Model (NCEM) while initially developed as a deterministic compartmental model of SARS-Cov-2 transmission in the nine provinces of South Africa, was adapted several times over the course of the first wave of infection in response to emerging local data and changing needs of government. By the end of the first wave, the NCEM had developed into a stochastic, spatially-explicit compartmental transmission model to estimate the total and reported incidence of COVID-19 across the 52 districts of South Africa. The model adopted a generalised Susceptible-Exposed-Infectious-Removed structure that accounted for the clinical profile of SARS-COV-2 (asymptomatic, mild, severe and critical cases) and avenues of treatment access (outpatient, and hospitalisation in non-ICU and ICU wards). Between end-March and early September 2020, the model was updated 11 times with four key releases to generate new sets of projections and scenario analyses to be shared with planners in the national and provincial Departments of Health, the National Treasury and other partners. Updates to model structure included finer spatial granularity, limited access to treatment, and the inclusion of behavioural heterogeneity in relation to the adoption of Public Health and Social Measures. These updates were made in response to local data and knowledge and the changing needs of the planners. The NCEM attempted to incorporate a high level of local data to contextualise the model appropriately to address South Africa’s population and health system characteristics that played a vital role in producing and updating estimates of resource needs, demonstrating the importance of harnessing and developing local modelling capacity.
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on two single-cell transcriptomic datasets.
Daniel Mas Montserrat, Arvind Kumar, Carlos Bustamante
et al.
With declining sequencing costs a promising and affordable tool is emerging in cancer diagnostics: genomics. By using association studies, genomic variants that predispose patients to specific cancers can be identified, while by using tumor genomics cancer types can be characterized for targeted treatment. However, a severe disparity is rapidly emerging in this new area of precision cancer diagnosis and treatment planning, one which separates a few genetically well-characterized populations (predominantly European) from all other global populations. Here we discuss the problem of population-specific genetic associations, which is driving this disparity, and present a novel solution--coordinate-based local ancestry--for helping to address it. We demonstrate our boosting-based method on whole genome data from divergent groups across Africa and in the process observe signals that may stem from the transcontinental Bantu-expansion.
Giampaolo Cristadoro, Mirko Degli Esposti, Eduardo G. Altmann
Genetic sequences are known to possess non-trivial composition together with symmetries in the frequencies of their components. Recently, it has been shown that symmetry and structure are hierarchically intertwined in DNA, suggesting a common origin for both features. However, the mechanism leading to this relationship is unknown. Here we investigate a biologically motivated dynamics for the evolution of genetic sequences. We show that a metastable (long-lived) regime emerges in which sequences have symmetry and structure interlaced in a way that matches that of extant genomes.
Maddalena Dilucca, Giulio Cimini, Sergio Forcelloni
et al.
We study the correlation between the codon usage bias of genetic sequences and the network features of protein-protein interaction (PPI) in bacterial species. We use PCA techniques in the space of codon bias indices to show that genes with similar patterns of codon usage have a significantly higher probability that their encoded proteins are functionally connected and interacting. Importantly, this signal emerges when multiple aspects of codon bias are taken into account at the same time. The present study extends our previous observations on E.Coli over a wide set of 34 bacteria. These findings could allow for future investigations on the possible effects of codon bias on the topology of the PPI network, with the aim of improving existing bioinformatics methods for predicting protein interactions.
Quentin Marcou, Thierry Mora, Aleksandra M Walczak
High throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis and interpretation of these datasets. We present IGoR -- a new comprehensive tool that takes B or T-cell receptors sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can investigate models of increasing biological complexity for different organisms. For B-cells IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.
A new approach to estimate the Shannon entropy of a long-range correlated sequence is proposed. The entropy is written as the sum of two terms corresponding respectively to power-law (\emph{ordered}) and exponentially (\emph{disordered}) distributed blocks (clusters). The approach is illustrated on the 24 human chromosome sequences by taking the nucleotide composition as the relevant information to be encoded/decoded. Interestingly, the nucleotide composition of the \emph{ordered} clusters is found, on the average, comparable to the one of the whole analyzed sequence, while that of the \emph{disordered} clusters fluctuates. From the information theory standpoint, this means that the power-law correlated clusters carry the same information of the whole analysed sequence. Furthermore, the fluctuations of the nucleotide composition of the disordered clusters are linked to relevant biological properties, such as segmental duplications and gene density.
The UCSC Genome Browser is a popular tool for the exploration and analysis of reference genomes. Mirrors of the UCSC Genome Browser and its contents exist at multiple geographic locations, and this mirror procedure has been modified to support genome sequences not maintained by UCSC and generated by individual researchers. While straightforward, this procedure is lengthy and tedious and would benefit from automation, especially when processing many genome sequences. We present a Unix shell program that facilitates the creation of custom instances of the UCSC Genome Browser for genome sequences not being maintained by UCSC. It automates many steps of the browser creation process, provides password protection for each browser instance, and automates the creation of basic annotation tracks. As an example we generate a custom UCSC Genome Browser for a bacterial genome obtained from a massively parallel sequencing platform.
Phylogenomic approaches developed thus far are either too time-consuming or lack a solid evolutionary basis. Moreover, no phylogenomic approach is capable of constructing a tree directly from unassembled raw sequencing data. A new phylogenomic method, CO-phylum, is developed to alleviate these flaws. CO-phylum can generate a high-resolution and highly accurate tree using complete genome or unassembled sequencing data of close related organisms, in addition, CO-phylum distance is almost linear with p-distance.
We study the dynamical properties of small regulatory networks treated as non autonomous dynamical systems called modules when working inside larger networks or, equivalently when subject to external signal inputs. Particular emphasis is put on the interplay between the internal properties of the open systems and the different possible inputs on them to deduce new functionalities of the modules. We use discrete-time, piecewise-affine and piecewise-contracting models with interactions of a regulatory nature to perform our study.
The classification of life should be based upon the fundamental mechanism in the evolution of life. We found that the global relationships among species should be circular phylogeny, which is quite different from the common sense based upon phylogenetic trees. The genealogical circles can be observed clearly according to the analysis of protein length distributions of contemporary species. Thus, we suggest that domains can be defined by distinguished phylogenetic circles, which are global and stable characteristics of living systems. The mechanism in genome size evolution has been clarified; hence main component questions on C-value enigma can be explained. According to the correlations and quasi-periodicity of protein length distributions, we can also classify life into three domains.
Natively unfolded proteins exist as an ensemble of flexible conformations lacking a well defined tertiary structure along a large portion of their polypeptide chain. Despite the absence of a stable configuration, they are involved in important cellular processes. In this work we used from three indicators of folding status, derived from the analysis of mean packing and mean contact energy of a protein sequence as well as from VSL2, a disorder predictor, and we combined them into a consensus score to identify natively unfolded proteins in several genomes from Archaea, Bacteria and Eukarya. We found a high correlation among the number of predicted natively unfolded proteins and the number of proteins in the genomes. More specifically, the number of natively unfolded proteins scaled with the number of proteins in the genomes, with exponent 1.81 +- 0.10. This scaling law may be important to understand the relation between the number of natively unfolded proteins and their roles in cellular processes.
This work is concerned with the development of a well-founded, theoretically justified, and least complicated metric for the classification of proteins with reference to enzymes. As the signature of an enzyme family, a catalytic domain is easily fingerprinted. Given that the classification problem has so far seemed intractable, a classification schema derived from the catalytic domain would be satisfying. Here I show that there exists a natural ab initio if nonobvious basis to theorize that the catalytic domain of an enzyme is uniquely informative about its regulation. This annotates its function. Based on this hypothesis, a method that correctly classifies potassium ion channels into their respective subfamilies is described. To put the principle on firmer ground, extra validation was sought and obtained through co-evolutionary analyses. The co-evolutionary analyses reveal a departure from the notion that potassium ion channel proteins are functionally modular. This finding is discussed in light of the prevailing notion of domain. These studies establish that significant co-evolution of the catalytic domain of a gene with its conjoint domain is a specialized, necessary process following fusion and swapping events in evolution. Instances of this discovery are likely to be found pervasive in protein science.
We report on a theoretical study of point mutations effects on charge transfer properties in the DNA sequence of the tumor-suppressor p53 gene. On the basis of effective single-strand or double-strand tight-binding models which simulate hole propagation along the DNA, a statistical analysis of charge transmission modulations associated with all possible point mutations is performed. We find that in contrast to non-cancerous mutations, mutation hotspots tend to result in significantly weaker {\em changes of transmission properties}. This suggests that charge transport could play a significant role for DNA-repairing deficiency yielding carcinogenesis.
Recombination is an important event in the evolution of HIV. It affects the global spread of the pandemic as well as evolutionary escape from host immune response and from drug therapy within single patients. Comprehensive computational methods are needed for detecting recombinant sequences in large databases, and for inferring the parental sequences. We present a hidden Markov model to annotate a query sequence as a recombinant of a given set of aligned sequences. Parametric inference is used to determine all optimal annotations for all parameters of the model. We show that the inferred annotations recover most features of established hand-curated annotations. Thus, parametric analysis of the hidden Markov model is feasible for HIV full-length genomes, and it improves the detection and annotation of recombinant forms. All computational results, reference alignments, and C++ source code are available at http://bio.math.berkeley.edu/recombination/.
Most proteins involved in processing DNA accomplish their activities as a monomer or as a component of a multimer containing a relatively small number of other elements. They generally act locally, binding to one or a few small regions of the DNA substrate. Striking exceptions are the \textit{E. coli} protein RecA and its homologues in other species, whose activities are associated with homologous DNA recombination. The active form of RecA in DNA recombination is a stiff nucleoprotein filament formed by RecA and DNA, within which the DNA is extended by 50%. Invoking physical and geometrical ideas, we show that the filamentary organization greatly enhances the rate of homologous recognition while preventing the formation of topological traps originating from multi-site recognition.