权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing

将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调

基本信息

批准号：
RGPIN-2019-05683
负责人：
Keselj, Vlado
金额：
$ 2.04万
依托单位：
Dalhousie University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=750271
关键词：
Harmonizing String Unification based Methodology

项目摘要

This research proposal aims at advancing the state of the art in the natural language processing at three levels in order to meet demands for better information processing. At the lowest level, the character and word n-gram level processing, our objectives are to improve n-gram based text mining through the use of variable-length n-gram profiles, n-gram based visual text analytics through visualization of n-gram profiles and corresponding Eulerian graphs, comparison of current CNG distance measure with other measures (e.g., Jaccard, Dice) at a deeper model level, use of Google N-grams data in improving the standard language n-gram profiles, and adaptation of Normalized google Distance to achieve an off-line distance. At the middle level of processing (RegEx based), we will advance development of regular expression patterns for directed sentiment analysis and parsing of noisy text, examining the ways to generate RegEx-based patterns, generating patterns from Google N-grams data, and extending the Starfish system for text-embedded processing. At the third level, the unification level, our bojectives are: to transfer sub-graph isomorphism technique from analysis in biomedical scientific domain to information gathering from social media, concept semantic relationship generation from Wikipedia data, and semantic-based visualization of stream textual data, such as visualization of e-mail streams. Our Approach is based on the previous work ot these three levels of language processing: (1) Common N-Gram analysis (CNG), where the text data is modelled using character n-gram profiles; (2) Regular Expression based processing of textual data, based on applying RegEx rewriting patterns, and matching the data with similar patterns, and (3) at the Unification level, we apply information extraction and matching using unification or sub-graph isomorphism, and the structural data itself is generated by parsing using the stochastic unification-based grammars. Novelty and Expected Significance of the approach is based on improving methodology to provide for visual text analysis, i.e., visualization and closer interaction with the user, and for better adaptation and development of methodology for new kind of textual data and novel applications coming from the expansion of Internet data and social media. The significance of the approaches is supported by strong interest coming from industrial partners in the area of summarized analysis of social media data.

本研究的目的是在三个层次上推进自然语言处理的最新技术水平，以满足更好的信息处理需求。在最低层次，字符和单词n-gram级处理，我们的目标是通过使用可变长度n-gram配置文件来改进基于n-gram的文本挖掘，基于n-gram的可视文本分析，通过n-gram简档和对应的欧拉图的可视化，当前CNG距离测量与其他测量的比较（例如，Jaccard，Dice）在更深的模型级别上，使用Google N-gram数据来改进标准语言n-gram配置文件，并调整归一化Google距离以实现离线距离。在处理的中间层（基于RegEx），我们将推进正则表达式模式的开发，用于定向情感分析和噪声文本的解析，研究生成基于RegEx的模式的方法，从Google N-gram数据生成模式，并扩展Starfish系统用于文本嵌入式处理。在第三个层次，统一的水平，我们的目标是：子图同构技术从生物医学科学领域的分析，从社会媒体的信息收集，从维基百科数据的概念语义关系生成，和基于语义的可视化流文本数据，如可视化的电子邮件流。我们的方法是基于这三个层次的语言处理的前期工作：（1）公共N-Gram分析（CNG），其中文本数据使用字符n-gram配置文件建模;（2）基于正则表达式的文本数据处理，基于应用RegEx重写模式，并将数据与类似模式匹配，以及（3）在统一级别，我们使用统一或子图同构来应用信息提取和匹配，并且结构数据本身通过使用基于随机统一的语法的解析来生成。该方法的新奇性和预期意义是基于改进的方法学来提供视觉文本分析，即，可视化和与用户更密切的互动，以及更好地适应和发展新的文本数据和互联网数据和社交媒体扩展带来的新应用程序的方法。这些方法的重要性得到了来自社会媒体数据汇总分析领域的工业合作伙伴的强烈兴趣的支持。