权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Intelligent Information Retrieval Systems for Text Databases of Japanese and Chinese Classics

日汉典籍文本数据库智能信息检索系统

基本信息

批准号：
22H03903
负责人：
肖川
金额：
$ 10.82万
依托单位：
Osaka University
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (B)
财政年份：
2022
资助国家：
日本
起止时间：
2022-04-01 至 2026-03-31
项目状态：
未结题

来源：
https://kaken.nii.ac.jp/en/grant/KAKENHI-PROJECT-22H03903/
关键词：
情報検索和漢書データベース知識ベース

项目摘要

本年度は、和漢書テキストに対する漢文固有表現の抽出と統合を行った。既存の和漢書テキストデータベースは、情報検索機能には十分対応できていない。主な原因は、漢文の文法のため、和漢書テキストの多くには固有名詞の別称や省略が存在し、検索のキーワードと完全に一致する結果しか検出できない。固有名詞の別称を含む結果を検出するため、事前に固有名詞とその別称を和漢書テキストから抽出することが求められる。ただ、現代中国語と異なり、漢文のテキストには句読点がないことが多いため、句読点のないデータを扱うことは困難な課題である。以上の問題に対処するため、トークンフリーの事前学習済みモデルを活用した。これまでの最も広く使用されている事前学習済み言語モデルは、単語や部分単語単位に対応するトークンのシーケンスに作用する。これに対して、トークンフリーのモデルは、生のテキスト（バイトまたは文字）に直接作用し、多くの利点を持っている。例えば、任意の言語のテキストを処理することができ、ノイズに対してより堅牢であり、複雑でエラーが発生しやすいテキスト前処理パイプラインを取り除くことができる。それらの利点を考えて、ByT5というトークンフリーのモデルに基づく漢文の事前学習済み言語モデルを開発し、漢文の固有名詞認識のために学習済みモデルを微調整（fine-tune）した。微調整されたモデルは、既存の手法を大幅に上回る性能を発揮し、いわゆるグラウンドトゥルース（C-CLUE）のエラーさえも訂正できる。初期の結果はDEIM 2023学会で発表された。詳細な研究成果はEMNLP 2023に提出される予定である。さらに、データ統合のため、意味的に等価なコンテンツの識別手法を開発し、研究成果はVLDB 2023学会で発表される予定である。

In the current year, there is an inherent table in the text of this year's report, which shows that the system has been drawn out. The existing information and information systems are in high demand, and the information demand machine is very effective. The main cause, the grammar, and the inherent name of the grammar are known as the omission of existence, and the results are consistent with each other. The inherent name is another name, which contains the results. The results show that the inherent name is not known in advance, and that the inherent name is not known in advance and that it is pulled out. On behalf of the Chinese people, please do not know what to do, and the words will tell you how to do it. The above questions will help you to learn how to use them in advance. The most important thing is to use the information in advance to learn how to speak in advance, and that in some parts of the book, there are some questions about the role of the child. The word "direct action", and so on. For example, if you don't want to talk about it, please do not know what to do. If you want to do so, please do not know what to do. If you do not want to do so, please do not know what to do. In order to make a profit, ByT5 should learn how to learn English in advance, and that the proper name of the text would be to learn how to learn English, to learn how to do so, and to study it in a micro-computer (fine-tune). Micro-adjustment, existing performance, performance and performance. The initial results show that DEIM 2023 learns to learn more about the table. In this paper, the research results of EMNLP 2023 have been proposed to predict the accuracy of the research. The results of the research are listed in the VLDB 2023 Institute of Science and Technology, and the results of the research are listed in the table of VLDB 2023.