![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
主要目标是通过以下标识符之一来识别与每个文档关联的客户:
The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:
目标是使用命名实体识别 (NER) 从法律文档中提取客户名称。我是这样完成任务的:
Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:
数据:我收集了 PDF 格式的法律文件。任务是使用以下标识符之一来识别每个文档中提到的客户:
Approximate client name (e.g., "John Doe")
大概的客户名称(例如“John Doe”)
Precise client name (e.e., "Doe, John A.")
准确的客户名称(例如“Doe, John A.”)
Approximate firm name (e.g., "Doe Law Firm")
公司大致名称(例如“Doe Law Firm”)
Precise firm name (e.g., "Doe, John A. Law Firm")
准确的公司名称(例如“Doe, John A. Law Firm”)
About 5% of the documents didn't include any identifying entities.
大约 5% 的文件不包含任何识别实体。
Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.
数据集:为了开发模型,我使用了 710 个“真实”PDF 文档,这些文档分为三组:600 个用于训练,55 个用于验证,55 个用于测试。
Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:
标签:我收到了一个 Excel 文件,其中的实体被提取为纯文本,需要在文档文本中手动标记。使用 BIO 标记格式,我执行了以下步骤:
Mark the beginning of an entity with "B-
用“B-”标记实体的开头。
Continue marking subsequent tokens within the same entity with "I-
继续用“I-”标记同一实体内的后续标记。
If a token doesn't belong to any entity, mark it as "O".
如果令牌不属于任何实体,则将其标记为“O”。
Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.
替代方法:像 LayoutLM 这样的模型也考虑了输入标记的边界框,可能会提高 NER 任务的性能。然而,我选择不使用这种方法,因为通常情况下,我已经花费了项目的大部分时间来准备数据(例如,重新格式化 Excel 文件、更正数据错误、标记)。为了集成基于边界框的模型,我需要分配更多的时间。
While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.
虽然理论上可以应用正则表达式和启发式方法来识别这些简单的实体,但我预计这种方法是不切实际的,因为它需要过于复杂的规则来精确识别其他潜在候选者中的正确实体(例如,律师姓名、案件编号、其他实体)。诉讼程序的参与者)。相比之下,该模型能够学习区分相关实体,从而使启发式方法的使用变得多余。
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
- 比特币的图案中断:霍德尔是下一个激增的关键吗?
- 2025-07-04 18:50:12
- 比特币调情带有新的高点,链链数据表明,霍德林比以往任何时候都更强大。这种模式破坏了下一次激增的钥匙,还是退伍军人兑现?
-
- 比特币价格,特朗普的账单和15万美元的梦想:纽约市
- 2025-07-04 19:50:12
- 特朗普的“大美丽比尔”引发了辩论。它会将比特币发送到$ 150K吗?我们分解了可能的结果及其对您的数字钱包的意义。
-
-
- Binance机构贷款:解锁鲸鱼的4倍杠杆和零利息
- 2025-07-04 19:15:12
- Binance正在为具有新贷款产品的机构客户升级其游戏,包括高达4倍的杠杆和潜在的零利率利率。这是故障。
-
- 比特币公牛运行:分析师在2025年底的Eye Peak?
- 2025-07-04 19:20:13
- 分析师正处于比特币目前牛的潜在末端,预测指向2025年底的高峰。这就是崩溃。
-
- Pepe指标,看涨预测:模因硬币可以集会吗?
- 2025-07-04 19:25:12
- 分析PEPE指标的看涨潜力。集会在地平线上吗?获取最新的预测和关键见解。
-
- 模因硬币,加密代币和开玩笑的创建:纽约人的拍摄
- 2025-07-04 18:30:12
- 探索从笑话创建到加密令牌的模因硬币的野生世界,以及塑造其价值的动态。潜入炒作和风险。
-
- 升级您的草坪:草种子,花园专家和1英镑的硬币黑客!
- 2025-07-04 18:30:12
- 将秘密解锁到郁郁葱葱的草坪上,并提供有关草种子的专家技巧和巧妙的1英镑硬币黑客。另外,狗主人,当心讨厌的草种子!
-