Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing

将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调

基本信息

  • 批准号:
    RGPIN-2019-05683
  • 负责人:
  • 金额:
    $ 2.04万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2020
  • 资助国家:
    加拿大
  • 起止时间:
    2020-01-01 至 2021-12-31
  • 项目状态:
    已结题

项目摘要

This research proposal aims at advancing the state of the art in the natural language processing at three levels in order to meet demands for better information processing. At the lowest level, the character and word n-gram level processing, our objectives are to improve n-gram based text mining through the use of variable-length n-gram profiles, n-gram based visual text analytics through visualization of n-gram profiles and corresponding Eulerian graphs, comparison of current CNG distance measure with other measures (e.g., Jaccard, Dice) at a deeper model level, use of Google N-grams data in improving the standard language n-gram profiles, and adaptation of Normalized google Distance to achieve an off-line distance. At the middle level of processing (RegEx based), we will advance development of regular expression patterns for directed sentiment analysis and parsing of noisy text, examining the ways to generate RegEx-based patterns, generating patterns from Google N-grams data, and extending the Starfish system for text-embedded processing. At the third level, the unification level, our bojectives are: to transfer sub-graph isomorphism technique from analysis in biomedical scientific domain to information gathering from social media, concept semantic relationship generation from Wikipedia data, and semantic-based visualization of stream textual data, such as visualization of e-mail streams. Our Approach is based on the previous work ot these three levels of language processing: (1) Common N-Gram analysis (CNG), where the text data is modelled using character n-gram profiles; (2) Regular Expression based processing of textual data, based on applying RegEx rewriting patterns, and matching the data with similar patterns, and (3) at the Unification level, we apply information extraction and matching using unification or sub-graph isomorphism, and the structural data itself is generated by parsing using the stochastic unification-based grammars. Novelty and Expected Significance of the approach is based on improving methodology to provide for visual text analysis, i.e., visualization and closer interaction with the user, and for better adaptation and development of methodology for new kind of textual data and novel applications coming from the expansion of Internet data and social media. The significance of the approaches is supported by strong interest coming from industrial partners in the area of summarized analysis of social media data.
这一研究方案旨在从三个层面提高自然语言处理的水平,以满足对更好的信息处理的需求。在最低级别,字符和单词n-gram级别的处理,我们的目标是通过使用可变长度的n-gram配置文件来改进基于n-gram的文本挖掘,通过n-gram配置文件和相应的欧拉图的可视化来改进基于n-gram的可视文本分析,在更深的模型级别将当前CNG距离度量与其他度量(例如,Jaccard,Dice)进行比较,使用Google N-gram数据来改进标准语言n-gram配置文件,以及调整归一化Google距离以实现离线距离。在处理的中间级别(基于RegEx),我们将推进用于定向情感分析和嘈杂文本解析的正则表达式模式的开发,研究生成基于RegEx的模式的方法,生成 Google N-gram数据中的模式,以及扩展StarFish系统以进行文本嵌入处理。在第三个层次,即统一层,我们的目标是:将子图同构技术从生物医学科学领域的分析转移到从社交媒体上收集信息,从维基百科数据生成概念语义关系,以及基于语义的流文本数据可视化,如电子邮件流的可视化。 我们的方法是基于这三个语言处理级别的以前的工作:(1)公共N-Gram分析(CNG),其中文本数据是使用字符n-gram轮廓建模的;(2)基于正则表达式的文本数据处理,基于应用RegEx重写模式,并将数据与相似模式进行匹配;(3)在统一级别,我们使用统一或子图同构来进行信息提取和匹配,并且结构数据本身是通过使用基于随机统一的语法分析来生成的。 该方法的新颖性和预期意义是基于改进的方法,以提供可视化的文本分析,即可视化和与用户更密切的交互,并更好地 适应和发展因互联网数据和社交媒体的扩展而产生的新类型文本数据和新应用的方法论。这些方法的重要性得到了行业合作伙伴在社交媒体数据汇总分析领域的强烈兴趣的支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Keselj, Vlado其他文献

Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests
  • DOI:
    10.1016/j.datak.2006.06.001
  • 发表时间:
    2007-05-01
  • 期刊:
  • 影响因子:
    2.5
  • 作者:
    Liu, Haibin;Keselj, Vlado
  • 通讯作者:
    Keselj, Vlado
N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia
Syllabification rules versus data-driven methods in a language with low syllabic complexity: The case of Italian
  • DOI:
    10.1016/j.csl.2009.02.004
  • 发表时间:
    2009-10-01
  • 期刊:
  • 影响因子:
    4.3
  • 作者:
    Adsett, Connie R.;Marchand, Yannick;Keselj, Vlado
  • 通讯作者:
    Keselj, Vlado

Keselj, Vlado的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Keselj, Vlado', 18)}}的其他基金

Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
  • 批准号:
    RGPIN-2019-05683
  • 财政年份:
    2022
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
  • 批准号:
    RGPIN-2019-05683
  • 财政年份:
    2021
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
  • 批准号:
    RGPIN-2019-05683
  • 财政年份:
    2019
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2018
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Social Media Analytics for Effective and Efficient Event Detection
用于有效且高效事件检测的社交媒体分析
  • 批准号:
    523332-2018
  • 财政年份:
    2018
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Engage Grants Program
Combined N-gram and Semantic Approach to Assignment Feedback Analysis and Generation
结合 N 元语法和语义方法进行作业反馈分析和生成
  • 批准号:
    507291-2016
  • 财政年份:
    2016
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Engage Grants Program
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2016
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Linking External Sources to Legal Contracts using Semantic Similarity
使用语义相似性将外部来源链接到法律合同
  • 批准号:
    490729-2015
  • 财政年份:
    2015
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Engage Grants Program
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2015
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Developing NLP Capabilities for Quality and Risk Attribute Analysis of Requirements Specifications and Technical Documentation
开发需求规范和技术文档的质量和风险属性分析的 NLP 能力
  • 批准号:
    490728-2015
  • 财政年份:
    2015
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Engage Grants Program

相似国自然基金

带应力string方法及其在材料计算中的应用
  • 批准号:
    11001244
  • 批准年份:
    2010
  • 资助金额:
    17.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
  • 批准号:
    RGPIN-2019-05683
  • 财政年份:
    2022
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
  • 批准号:
    RGPIN-2019-05683
  • 财政年份:
    2021
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
  • 批准号:
    RGPIN-2019-05683
  • 财政年份:
    2019
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2018
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2016
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Unification of the standard model and the Planck scale physics based on constructive formulation of string theory
基于弦理论构造性表述的标准模型和普朗克尺度物理的统一
  • 批准号:
    16K05322
  • 财政年份:
    2016
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2015
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2014
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
  • 批准号:
    262059-2013
  • 财政年份:
    2013
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Harmonized string-based and unification-based methodology for text mining and processing
用于文本挖掘和处理的基于字符串和统一的统一方法
  • 批准号:
    262059-2008
  • 财政年份:
    2012
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了