$114716.520400 USD

-0.70%

ethereum

$4729.382476 USD

0.18%

xrp

$3.008706 USD

-0.94%

tether

$0.999885 USD

0.03%

bnb

$862.676044 USD

-2.80%

solana

$204.594413 USD

1.02%

usd-coin

$0.999634 USD

-0.02%

dogecoin

$0.229793 USD

-2.95%

tron

$0.364818 USD

0.97%

cardano

$0.892100 USD

-2.29%

chainlink

$25.402428 USD

-1.20%

hyperliquid

$43.459219 USD

-0.55%

sui

$3.618554 USD

-2.92%

stellar

$0.403765 USD

-2.14%

ethena-usde

$1.000309 USD

-0.02%

加密货币新闻

Co-citation based data augmentation for contrastive learning of scientific domains

2025/04/29 14:08

Data compilation

We used co-citations as a similarity heuristic to generate sufficiently large training datasets for contrastive learning over scientific domains. Our strategy enabled the production of large training datasets from small amounts of data due to the nonlinear scaling of citation graphs, as a single paper citing N other papers produces (N2) co-citation pairs. For context, a dataset of 10,000 individual papers can produce well over 125,000 co-citation pairs. While this measurement of similarity is not perfect, co-citations have generally been shown to imply a high degree of similarity between papers21. We assume for our modeling purposes that two co-cited papers are more similar than two random papers, even if they are from the same field.

To build our dataset, we randomly chose five biomedical subfields with little overlap. The domains of choice include papers related to cardiovascular disease (CVD), chronic obstructive pulmonary disease (COPD), parasitic diseases, autoimmune diseases, and skin cancers. PubMed Central was queried with Medical Subject Heading (MeSH) terms for each domain, requiring at least one citation and an abstract present between 2010 and 2022. This means that within the time period, we kept the co-citation pairs of the possible (N2) co-citations per paper that were returned from the same common MeSH terms. We sampled preferentially from samples co-cited more times when constructing our final dataset.

For evaluation, we constructed “negative” examples of abstract pairs that were not co-cited. The training dataset was split randomly in a 99:1 ratio followed by deduplication. We built negative pairs by pairing abstracts that had not been co-cited and had both been cited at least 15 times. This criteria allowed us to construct a representative evaluation set for binary classification with balanced classes, with 1’s for co-cited pairs and 0 if not. The exact dataset counts are outlined in Table 1.

Transformer neural networks

The transformer architecture is adept at sequential processing and is state-of-the-art for various natural language processing (NLP) and vision tasks24,25,26,27,28,29,30. A transformer block comprised a self-attention layer and multi-layer perception (MLP) interleaved with skip connections. Full transformers were made of T transformer blocks stacked together1.

Prior to the transformer blocks is the token embedding process, where tokenization maps an input string into a list of L integers from a dictionary. These integers served as the indices for a matrix We, where each row is a learnable representative vector for that token, making We∈ Rv×d where v is the total number of unique tokens in the vocabulary and d an arbitrarily chosen hidden dimension. The initial embedding is Rl×d.

Each block in the transformer then transforms this embedding, i.e., the i^{th} transformer block maps the embedding X(i-1) = [x1(i-1), ..., xL(i-1)]⊤∈Rl×d to X(i) = [x1(i), ..., xL(i)]⊤∈Rl × d1,31,32. X(T) is the last hidden state of the network. The first part of this map is self-attention, which mixes information across the vectors, followed by the MLP which mixes information across d31,33.

Including the MLP, the entire transformer block can be written as:

where b1 and b2 are biases associated with learned linear transformations W1 ∈ Rd×I and W2 ∈ RI×d, where I > d. The activation function σ, e.g., ReLU or GeLU, introduces non-linearity1. More recently, biases are not included, which improves training stability, throughput, and final performance. Additionally, improvements like SwiGLU activation functions and rotary positional embeddings are also commonly utilized3,4,34,35.

GPT (Generative Pretrained Transformer) models, such as OpenAI’s GPT series (GPT-3, GPT-4, etc.), are designed for generative tasks and use transformer decoders36,37,38. They employ causal (unidirectional) attention, meaning each token attends only to previous tokens in the sequence, enabling autoregressive generation during inference. This allows them to predict the next word in a sequence without direct access to future words.

In contrast, BERT models utilize transformer encoders with bidirectional attention, meaning they can attend to all tokens within an input simultaneously. This structure enables them to capture additional contextual dependencies, making them well-suited for tasks like text classification and sentence similarity39. Unlike GPT models, BERT is trained using a masked language modeling (MLM) objective, where some tokens are randomly hidden, requiring the model to predict them based on the surrounding context.

Mixture of Experts

Mixture of

原文来源：nature

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年08月25日发表的其他文章