权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Detecting relevant segment of text in legal domain

检测法律领域中的相关文本片段

基本信息

批准号：
499514-2016
负责人：
Makrehchi, Masoud
金额：
$ 1.82万
依托单位：
University of Ontario Institute of Technology
依托单位国家：
加拿大
项目类别：
Engage Grants Program
财政年份：
2016
资助国家：
加拿大
起止时间：
2016-01-01 至 2017-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=607502
关键词：
Detecting relevant segment text legal

项目摘要

The goal of the research is to investigate, design, and implement algorithms to detect (or recognize) and extract the relevant segment of text, predict and recognize legal entities and context, and finally generate an appropriate metadata to be stored in a structured database. The database can be utilized in several scenarios from the user query to legal research by law practitioners. The notion of "relevant segment" is defined as a contiguous piece of a text which is relevant to the question of interest (or simply query). Relevance can be measured by different methods depending how relevance is being interpreted. If we are looking for the name of a judge in a legal document, we can use a wide range of information extraction (IE) tools. IE takes advantage of a broad spectrum of techniques from image segmentation, when the image of the document is available and a relevant segment is highly expected in a specific zone, to Conditional Random Fields (CRF) and Markov Models to Machine Learning and classification. While the structured pieces of information such as entities can be extracted using IE techniques, for deeper, ambiguous, and conceptual components of a legal document such as the type of damage or the judge's decision and case outcome, we need to develop a supervised machine learning algorithms beyond IE techniques. This problem is neither a traditional IE problem nor a text classification. To solve this problem, a legal document is partitioned into conceptually-related segments such as header, case, citations, damages, decision, and so on. This step is called zoning and can be performed using supervised or unsupervised learning methods. Some zones such as headers are expected to appear in the very first section of the document and so they can be detected by unsupervised techniques. On the other hand, there are other components such as "damages" which may appear in any part of the documents and needs a supervised model using either lexicon-based or manually-labeled grand truth or both.

研究的目标是调查，设计和实现算法来检测（或识别）和提取相关的文本段，预测和识别法律的实体和上下文，并最终生成一个适当的元数据存储在一个结构化的数据库。该数据库可以用于从用户查询到法律从业者的法律的研究的几种情况。“相关段”的概念被定义为与感兴趣的问题（或简单的查询）相关的文本的连续部分。相关性可以通过不同的方法来衡量，这取决于如何解释相关性。如果我们要在法律的文件中查找法官的姓名，我们可以使用各种信息提取（IE）工具。IE利用了广泛的技术，从图像分割，当文档的图像可用并且在特定区域中高度期望相关片段时，到条件随机场（CRF）和马尔可夫模型，再到机器学习和分类。虽然可以使用IE技术提取实体等结构化信息，但对于法律的文件中更深层次、模糊和概念性的组成部分，例如损害类型或法官的判决和案件结果，我们需要开发一种监督机器学习算法IE技术之外的算法。这个问题既不是传统的IE问题，也不是文本分类问题。为了解决这个问题，一个法律的文档被划分成概念上相关的片段，如标题，案例，引用，损害赔偿，决定，等等。这个步骤被称为分区，可以使用监督或无监督学习方法来执行。某些区域（如标题）预计会出现在文档的第一部分，因此可以通过无监督技术检测到它们。另一方面，还有其他组件，如“损坏”，可能出现在文档的任何部分，需要使用基于词典或手动标记的大真值或两者的监督模型。