$109255.943346 USD

0.44%

ethereum

$2576.771422 USD

0.33%

tether

$1.000392 USD

0.00%

xrp

$2.244563 USD

0.13%

bnb

$661.282155 USD

0.33%

solana

$151.348303 USD

-0.88%

usd-coin

$0.999915 USD

0.00%

tron

$0.286551 USD

0.42%

dogecoin

$0.170740 USD

1.18%

cardano

$0.592419 USD

1.19%

hyperliquid

$39.292356 USD

-1.41%

sui

$3.003036 USD

3.67%

bitcoin-cash

$489.883884 USD

-2.29%

chainlink

$13.601976 USD

0.89%

unus-sed-leo

$9.023183 USD

0.31%

加密货币新闻

识别与法律文件相关的客户

2024/11/19 05:02

主要目标是通过以下标识符之一来识别与每个文档关联的客户：

The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:

目标是使用命名实体识别 (NER) 从法律文档中提取客户名称。我是这样完成任务的：

Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:

数据：我收集了 PDF 格式的法律文件。任务是使用以下标识符之一来识别每个文档中提到的客户：

Approximate client name (e.g., "John Doe")

大概的客户名称（例如“John Doe”）

Precise client name (e.e., "Doe, John A.")

准确的客户名称（例如“Doe, John A.”）

Approximate firm name (e.g., "Doe Law Firm")

公司大致名称（例如“Doe Law Firm”）

Precise firm name (e.g., "Doe, John A. Law Firm")

准确的公司名称（例如“Doe, John A. Law Firm”）

About 5% of the documents didn't include any identifying entities.

大约 5% 的文件不包含任何识别实体。

Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.

数据集：为了开发模型，我使用了 710 个“真实”PDF 文档，这些文档分为三组：600 个用于训练，55 个用于验证，55 个用于测试。

Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:

标签：我收到了一个 Excel 文件，其中的实体被提取为纯文本，需要在文档文本中手动标记。使用 BIO 标记格式，我执行了以下步骤：

Mark the beginning of an entity with "B-".

用“B-”标记实体的开头。

Continue marking subsequent tokens within the same entity with "I-".

继续用“I-”标记同一实体内的后续标记。

If a token doesn't belong to any entity, mark it as "O".

如果令牌不属于任何实体，则将其标记为“O”。

Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.

替代方法：像 LayoutLM 这样的模型也考虑了输入标记的边界框，可能会提高 NER 任务的性能。然而，我选择不使用这种方法，因为通常情况下，我已经花费了项目的大部分时间来准备数据（例如，重新格式化 Excel 文件、更正数据错误、标记）。为了集成基于边界框的模型，我需要分配更多的时间。

While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.

虽然理论上可以应用正则表达式和启发式方法来识别这些简单的实体，但我预计这种方法是不切实际的，因为它需要过于复杂的规则来精确识别其他潜在候选者中的正确实体（例如，律师姓名、案件编号、其他实体）。诉讼程序的参与者）。相比之下，该模型能够学习区分相关实体，从而使启发式方法的使用变得多余。

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年07月04日发表的其他文章