![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
加密貨幣新聞文章
Co-citation based data augmentation for contrastive learning of scientific domains
2025/04/29 14:08
Data compilation
We used co-citations as a similarity heuristic to generate sufficiently large training datasets for contrastive learning over scientific domains. Our strategy enabled the production of large training datasets from small amounts of data due to the nonlinear scaling of citation graphs, as a single paper citing N other papers produces (N2) co-citation pairs. For context, a dataset of 10,000 individual papers can produce well over 125,000 co-citation pairs. While this measurement of similarity is not perfect, co-citations have generally been shown to imply a high degree of similarity between papers21. We assume for our modeling purposes that two co-cited papers are more similar than two random papers, even if they are from the same field.
To build our dataset, we randomly chose five biomedical subfields with little overlap. The domains of choice include papers related to cardiovascular disease (CVD), chronic obstructive pulmonary disease (COPD), parasitic diseases, autoimmune diseases, and skin cancers. PubMed Central was queried with Medical Subject Heading (MeSH) terms for each domain, requiring at least one citation and an abstract present between 2010 and 2022. This means that within the time period, we kept the co-citation pairs of the possible (N2) co-citations per paper that were returned from the same common MeSH terms. We sampled preferentially from samples co-cited more times when constructing our final dataset.
For evaluation, we constructed “negative” examples of abstract pairs that were not co-cited. The training dataset was split randomly in a 99:1 ratio followed by deduplication. We built negative pairs by pairing abstracts that had not been co-cited and had both been cited at least 15 times. This criteria allowed us to construct a representative evaluation set for binary classification with balanced classes, with 1’s for co-cited pairs and 0 if not. The exact dataset counts are outlined in Table 1.
Transformer neural networks
The transformer architecture is adept at sequential processing and is state-of-the-art for various natural language processing (NLP) and vision tasks24,25,26,27,28,29,30. A transformer block comprised a self-attention layer and multi-layer perception (MLP) interleaved with skip connections. Full transformers were made of T transformer blocks stacked together1.
Prior to the transformer blocks is the token embedding process, where tokenization maps an input string into a list of L integers from a dictionary. These integers served as the indices for a matrix We, where each row is a learnable representative vector for that token, making We∈ Rv×d where v is the total number of unique tokens in the vocabulary and d an arbitrarily chosen hidden dimension. The initial embedding is Rl×d.
Each block in the transformer then transforms this embedding, i.e., the i^{th} transformer block maps the embedding X(i-1) = [x1(i-1), ..., xL(i-1)]⊤∈Rl×d to X(i) = [x1(i), ..., xL(i)]⊤∈Rl × d1,31,32. X(T) is the last hidden state of the network. The first part of this map is self-attention, which mixes information across the vectors, followed by the MLP which mixes information across d31,33.
Including the MLP, the entire transformer block can be written as:
where b1 and b2 are biases associated with learned linear transformations W1 ∈ Rd×I and W2 ∈ RI×d, where I > d. The activation function σ, e.g., ReLU or GeLU, introduces non-linearity1. More recently, biases are not included, which improves training stability, throughput, and final performance. Additionally, improvements like SwiGLU activation functions and rotary positional embeddings are also commonly utilized3,4,34,35.
GPT (Generative Pretrained Transformer) models, such as OpenAI’s GPT series (GPT-3, GPT-4, etc.), are designed for generative tasks and use transformer decoders36,37,38. They employ causal (unidirectional) attention, meaning each token attends only to previous tokens in the sequence, enabling autoregressive generation during inference. This allows them to predict the next word in a sequence without direct access to future words.
In contrast, BERT models utilize transformer encoders with bidirectional attention, meaning they can attend to all tokens within an input simultaneously. This structure enables them to capture additional contextual dependencies, making them well-suited for tasks like text classification and sentence similarity39. Unlike GPT models, BERT is trained using a masked language modeling (MLM) objective, where some tokens are randomly hidden, requiring the model to predict them based on the surrounding context.
Mixture of Experts
Mixture of
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
-
- Ruvi AI(RVU)利用區塊鍊和人工智能破壞營銷,娛樂和金融
- 2025-06-13 07:05:12
- 長期以來,Tron一直是區塊鏈項目如何通過關注其使命並實現一致增長來取得顯著成功的一個輝煌的例子。
-
-
-
- 隨著RWA集成計劃,信任錢包令牌(TWT)的價格下跌了5.7%
- 2025-06-13 06:45:13
- Trust Crypto Wallet的最新宣布將令牌化現實資產(RWAS)帶到其超過2億用戶引起了興奮。
-
- 以太坊(ETH)處於三階段市場週期的第二階段
- 2025-06-13 07:25:13
- 根據分析師Abbé提供的圖表和推文信息,以太坊(ETH)處於三階段市場週期的第二階段。
-
-
-