权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

単語ID最適化によるダブル配列言語モデルのサイズ縮小手法の開発

使用单词 ID 优化开发双数组语言模型的尺寸缩减方法

基本信息

批准号：
22K12162
负责人：
山本幹雄
金额：
$ 2.66万
依托单位：
University of Tsukuba
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
2022
资助国家：
日本
起止时间：
2022-04-01 至 2026-03-31
项目状态：
未结题

来源：
https://kaken.nii.ac.jp/grant/KAKENHI-PROJECT-22K12162/
关键词：
ngram言語モデルトライ木 remapping 線形配置問題トライハイパーグラフ

项目摘要

ダブル配列言語モデルはダブル配列を用いたコンパクトなngram言語モデルの実装であり、高速な検索を実現できる点を特徴としている。しかし、非常に大きなテキストデータから学習する場合、モデルサイズ・構築速度が悪化する。本研究では、ダブル配列のモデルサイズ縮小の問題が単語ID付与（トライ遷移行列での列並び）に大きく依存している事実に基づき、ngram言語モデルのサイズ・構築時間を縮小させる単語ID付与手法を開発することを大きな目的としている。初年度はいくつかの提案予定の手法のうち、単語IDをngramのレベルによって変化させるRemappingと呼ばれる手法をダブル配列に適用することを提案し、サイズ・構築速度の効率を改善できることを示した。Remappingはngram単語列をトライ木で表現した場合の各ノードから子ノードへ分岐する単語ID番号を付け替えて、子ノード集合への分岐のIDの幅（分岐する可能性のある単語ID番号の範囲）を小さくする手法である。トライ木中の単語IDを直前の単語(トライ木の1つ上のノード)に依存したIDに付け替える。直前の単語により単語の種類が限定されるため、単語ID番号を小さな範囲に限定できる。このRemappingの手法はこれまで文字列圧縮の効率化に使われてきた手法であるが、我々はこの手法をダブル配列の効率化に利用することを提案した。ngramの種類数が数億から10億程度のある程度の規模のデータを用いて、Remappingの効果を評価した。その結果、Remappingしない場合に比べて確実にサイズが縮小しており、最大で30%程度のサイズ削減効果を確認できた。また、ngramの種類数が多くなるほどサイズ削減効果が高まっており、スケール効果が期待できる。加えて、構築速度もわずかであるが早くなっていることが確認できた。

Please use the following words and phrases: you can use the ngram words to install the equipment, and the high-speed cable will show you that you have a lot of trouble. It is very difficult to learn how to change the speed. In this study, we are assigned to pay more information on small business issues, such as ID payments (please move the list and list them). In this study, you will be assigned to the list of major financial issues, such as basic information, ngram, and so on. In this study, you will be assigned to the list of major issues in this study. in this study, in this study, you will be assigned to increase the amount of money you need to pay for a small amount of money in the course of this study. At the beginning of the year, the proposal of "ID" and "ngram" predicted that "manipulation", "Remapping" call "manipulation", "allocation", "use", "speed", "rate", "improvement", "speed", "improvement". The Remapping ngram column shows that the number of bifurcations is different from each other, that is, the ID number, the number of bifurcations, the collection of bifurcations, the possibility of bifurcation, the range of ID numbers, the number of bifurcations, the amplitude of bifurcation, the possibility of bifurcation, and the range of ID numbers. In the middle of the tree, the ID goes straight ahead and depends on the ID to replace it. Go straight ahead and make sure that the general information is limited, and that the ID serial number is limited to a small range. "Remapping" technique, "text column", "text column", "rate", "make use of", "make use of". There are hundreds of millions of ngram types, and the scale of the 1 billion level of pollution is limited to the cost of consumption, and the results of Remapping are very expensive. The results show that the results of the Remapping test are smaller than those of the maximum 30% of the data. The number of types of hardware, ngram, etc., the number of different types of equipment, the number of types of hardware, the number of types of equipment, the number of types of hardware, the number of types of Add speed, and speed.