Trending Topics on Science, a Tensor Memory Hypothesis Approach
Abstrak
The current human knowledge is written. Documenting is the most used manner to preserve memories and to store fantastic stories. Thus, to distinguish the reality from fiction, the scientific writing cites previous works moreover than become form experimental setups. Books and scientific papers are only a small part of the existent literature but are considered more thrust as information sources. It is useful to find more relations and to know where to focus the lookup of a topic using the information about the authors and the keywords on the titles and abstracts. This is possible using relational databases or knowledge graphs, a semantic approach, but with the tensor memory hypothesis, that adds a temporal dimension, is possible to process the information with an episodic memory approach. If well, knowledge graphs are of extended use on question answering and chatbots, they need a previous relational schema generated automatically or by-hand and stored in an easy-to-query file format. I use JATS, a standard format that allows integrating scientific papers in semantic searches but is not spread on all scientific publishers, to extract the markup tags from PDF files, current year journal articles of one particular topic, and then construct the tensors memory with their references to extract relations and predictions with statistical relational learning techniques. Introduction Memory is defined as the ability to record information and after recall it. Writing is a human invention that facilitates this capacity in particular for declarative memories that are facts or events that can be expressed with language and it could be of two types: semantic or episodic (Tresp et al., 2017). The memories and knowledge of humanity are stored on written documents, getting more reliability if they include references to previous works from others authors. Scientific articles are the model of well-structured presentation and storage of information, each one of them with an own title, explicit authorship, and references to information related to other documents or within the same document. But, what almost always is relevant for the consideration of reading them, the retrieval action, is their publishing year. Thus, their ordered structure makes possible to use them as a representation of global human episodic knowledge and memories. Also, scientific publication as a human activity could be modeled as a social network. From this kind of networks the expression “trending topic” emerged to call the more frequent term or word used in a specific temporal window and it is understood as the principal theme or main subject that is related to the information described in a piece of content. In a mathematical and computational framework, semantic memories could be represented as knowledge graphs, where the entities are nodes and the links are relations between them. A relation between entities is then possible to define as a triple (s, p, o) or as a simple sentence subject-predicate-objective. An episodic memory adds a time marker, thus a temporal prepositional phrase is added to the simple sentence: subject-predicate-objective-temporal_preposition Proceedings of the 4th Congress on Robotics and Neuroscience or a quad (s, p, o, t). This approach is widely used on semantic web technologies under the Linked Datamethodology (Bizer et al., 2011). Thus, it is plausible to use complex networks analysis tools to search for the most relevant relations between authors, paper titles or keywords. The scientific publication databases can easily contain millions of authors, papers and their respective citations. A reduced number of relevant documents is expected from a specific topic query, and not thousands of results that search engines like Google Scholar or publisher’s own engines could generate for a given chain of words. The field of science of science studies these relations and the former works were realized using knowledge graphs, that are expressed as adjacency matrices. If the temporal dimension and various types of relationships are considered, then its possible to form tensors of fourth order. A matrix X of the network could be bipartite (X ∈ Rn×m) if there are two types of nodes (authors-articles, authors-words, articles-words) or monopartite (X ∈ Rn×n); unweighted ( xij ∈ {0, 1}) or weighted (xij ∈ R), directed or undirected (XT = X) (Zeng et al., 2017). (Tresp and Ma, 2017) introduced the Tensor Memory Hypothesis, where a knowledge graph is represented by a Tucker decomposition of the tensors. It is based on representational learning, i.e, a discrete entity e is associated with a vector of real numbers ae called latent variables. (Tresp and Ma, 2017) also argue that representational learning might also be the basis for perception, planning and decision making. From a physiological point of view, there is evidence that the hippocampus plays a central role in the temporal organization of memories and supports the disambiguation of overlapping episodes (Eichenbaum, 2014a), then in the standard consolidation of memory theory (SCT), the episodic memory is a neocortical representation that arises from hippocampal activity while in the multiple trace theory (MTT) the episodic memory is only represented on the hippocampus and is used to form semantic memories on the neocortex. Also, there is evidence of the existence of “place cells” and “time cells”in the hippocampus and that these support associative networks that represent spatiotemporal relations between the entities of memories (Eichenbaum, 2014b). Table 1. PCA variance for the number of latent components. Latent Components PCA variance (%) 3 2.93 5 4.3 10 7.32 15 10.03 20 12.5 25 14.8 50 24.99 100 41.88 200 63.9 There are some previous works on trending or hot topics in science: (Griffiths and Steyvers, 2004) used Latent Dirichlet Allocation (LDA) to analyze the abstracts from Proceedings on the National Academy of Sciences (PNAS) from 1991 to 2001. (Wei et al., 2013) performed a statistical analysis to find if scientists follow hot topics on their investigations, they used published papers from the American Physical Society (APS) Physical Review journals beginning in 1976 and ending in 2009. (Kang and Lin, 2018) used non-smooth non-negative matrix factorization (snNMF) to extract themore prominent topics from a dataset of keywords from scientific articles related to "Machine Learning" from 2014 to 2016 in arXiv.org stat.ML, the similarity of this work with the Tensor Memory Hypothesis belongs to the use of matrix decomposition to reduce the rank of the matrix. (Alshareef et al., 2018) indexes based on cosine similarity to estimate a score that represents the anticipation of a prospective relationship between authors. They used two subsets of the IEEE digital library containing the keywords “database” and “multimedia”. Results The quantity of latent components is not associated with a specific statistical measure of data. However, to have an approach, table 1 presents the correspondent percentage of variance if the same number of PCA components were employed. Proceedings of the 4th Congress on Robotics and Neuroscience Table 2. Most probable words for the query with an entity type. Entity Type Latent Components Authors Articles Words 3 neuromodulation neuromodulation neuromodulation 5 stimulus, presented stimulus, presented stimulus, technique 10 presented presented presented 15 sleep, memory sleep sleep 20 stimulus, memory stimulus, cued stimulus, cued 25 memory, sws memory, spatial, sws memory, sws 50 sleep, stimulus sleep, stimulus sleep, stimulus 100 assr, memory assr, memory assr, memory 200 wireless, monitoring sleep, slow sleep, slow Table 3. Most probable word with NMF decomposition. Entity Type Latent Components Authors Articles Words 3 slow, sleep, auditory stimulation, sleep sleep, memory 5 spindles, auditory, sleep sleep sleep 10 sleep, stimulation sleep, stimulation sleep, memory 15 sleep, memory brain, consolidation sleep, memory 20 sleep, memory oscillations, sleep sleep, memory 25 sleep, stimulation activity, memory sleep, memory 50 sleep, memory oscillations, humans sleep, memory 100 sleep, role reactivation, slow-wave sleep, memory 200 sleep, slow sleep, brain sleep, memory The words with more relations in the complete tensor, before decomposition, are sleep, memory, stimulation, slow, brain, consolidation, auditory, spindles, reactivation, and activity. Table 2 is populated using a selection strategy of most frequently word from queries of the type wordi = argmaxo{P (s, o, t)}, (1) where s is each author, paper title or word in the database, o a word, t a year and, i is the index of a entity . The most probable words, from the same queries, using more latent components are more than using a few latent variables. For example, there are 21 different words from query results using 200 latent components. In the other hand for few latent components, the results of queries are only the words shown in table 2. Table 3 is populated using the of NMF decomposition in the collapsed on time matrix, adding the weights of each year. The more frequently words are selected from which are maximum for each topic or k-row in the matrix H of the decompositions. The same processing using nsNMF decomposition results with the words sleep and memory as the most probable for all the cases. The analysis of relationships between entities needs a metric of distance. Each entity is represented by latent vectors, then one metric selection could be the Euclidean distance but given this particular type of data, content from documents, the usual metric employed is the cosine similarity. However, the use of distances on the original data space demand high computational costs, the use of a reduced space alleviates the computational cost of calculating distances but requires a previous high cost of space transformation. Figure 1 is an example of the Euclidean Proceedings of the 4th Congress on Robotics
Topik & Kata Kunci
Penulis (1)
Felipe Torres
Akses Cepat
- Tahun Terbit
- 2019
- Bahasa
- en
- Sumber Database
- Semantic Scholar
- DOI
- 10.21428/16E4EE64
- Akses
- Open Access ✓