![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
主要目標是透過以下標識符之一來識別與每個文件關聯的客戶:
The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:
目標是使用命名實體識別 (NER) 從法律文件中提取客戶名稱。我是這樣完成任務的:
Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:
資料:我收集了 PDF 格式的法律文件。任務是使用以下標識符之一來識別每個文件中提到的客戶:
Approximate client name (e.g., "John Doe")
大概的客戶名稱(例如“John Doe”)
Precise client name (e.e., "Doe, John A.")
準確的客戶名稱(例如“Doe, John A.”)
Approximate firm name (e.g., "Doe Law Firm")
公司大致名稱(例如“Doe Law Firm”)
Precise firm name (e.g., "Doe, John A. Law Firm")
準確的公司名稱(例如“Doe, John A. Law Firm”)
About 5% of the documents didn't include any identifying entities.
大約 5% 的檔案不包含任何識別實體。
Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.
資料集:為了開發模型,我使用了 710 個「真實」PDF 文檔,這些文檔分為三組:600 個用於訓練,55 個用於驗證,55 個用於測試。
Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:
標籤:我收到了一個 Excel 文件,其中的實體被提取為純文本,需要在文件文本中手動標記。使用 BIO 標記格式,我執行了以下步驟:
Mark the beginning of an entity with "B-
用“B-”標記實體的開頭。
Continue marking subsequent tokens within the same entity with "I-
繼續以「I-」標記同一實體內的後續標記。
If a token doesn't belong to any entity, mark it as "O".
如果令牌不屬於任何實體,則將其標記為“O”。
Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.
替代方法:像 LayoutLM 這樣的模型也考慮了輸入標記的邊界框,可能會提高 NER 任務的效能。然而,我選擇不使用這種方法,因為通常情況下,我已經花了專案的大部分時間來準備資料(例如,重新格式化 Excel 檔案、更正資料錯誤、標記)。為了整合基於邊界框的模型,我需要分配更多的時間。
While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.
雖然理論上可以應用正規表示式和啟發式方法來識別這些簡單的實體,但我預計這種方法是不切實際的,因為它需要過於複雜的規則來精確識別其他潛在候選者中的正確實體(例如,律師姓名、案件編號、其他實體)。相較之下,該模型能夠學習區分相關實體,從而使啟發式方法的使用變得多餘。
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
- 以太坊,山寨幣和期貨市場:炒作是什麼?
- 2025-09-17 16:00:13
- 以太坊和Altcoins正在期貨市場搶占焦點。他們是未來,還是比特幣只是喘口氣?讓我們潛水。
-
-
-
- 等離子體庫,XPL和加密貨幣:看不到$ 1嗎?
- 2025-09-17 15:53:56
- 等離子體庫的快速成功激發了XPL價格飆升。它會碰到$ 1還是面對蘸醬?深入了解Defi的創新趨勢。
-
- XYO價格激增潛力:XL1令牌和第1層區塊鏈發射
- 2025-09-17 15:49:22
- XYO Network推出了其1層區塊鍊和XL1代幣,引發了新的興趣和潛在的價格上漲。本文深入了解細節。
-
- Myx Finance Stock跳躍:騎波浪還是接近電阻水平?
- 2025-09-17 15:44:46
- Myx Finance看到鯨魚活動和V2升級炒作的激增。它是可持續的集會,還是阻力水平停止其上升?
-
- 皇家薄荷硬幣:發掘稀有性和麵值寶藏
- 2025-09-17 15:44:34
- 深入研究皇家薄荷硬幣的迷人世界,那裡的稀有和麵值相撞。發現可能潛伏在您的備用變化中的隱藏寶石!
-
-