Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
基本信息
- 批准号:RGPIN-2019-05683
- 负责人:
- 金额:$ 2.04万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2022
- 资助国家:加拿大
- 起止时间:2022-01-01 至 2023-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
This research proposal aims at advancing the state of the art in the natural language processing at three levels in order to meet demands for better information processing. At the lowest level, the character and word n-gram level processing, our objectives are to improve n-gram based text mining through the use of variable-length n-gram profiles, n-gram based visual text analytics through visualization of n-gram profiles and corresponding Eulerian graphs, comparison of current CNG distance measure with other measures (e.g., Jaccard, Dice) at a deeper model level, use of Google N-grams data in improving the standard language n-gram profiles, and adaptation of Normalized google Distance to achieve an off-line distance. At the middle level of processing (RegEx based), we will advance development of regular expression patterns for directed sentiment analysis and parsing of noisy text, examining the ways to generate RegEx-based patterns, generating patterns from Google N-grams data, and extending the Starfish system for text-embedded processing. At the third level, the unification level, our bojectives are: to transfer sub-graph isomorphism technique from analysis in biomedical scientific domain to information gathering from social media, concept semantic relationship generation from Wikipedia data, and semantic-based visualization of stream textual data, such as visualization of e-mail streams. Our Approach is based on the previous work ot these three levels of language processing: (1) Common N-Gram analysis (CNG), where the text data is modelled using character n-gram profiles; (2) Regular Expression based processing of textual data, based on applying RegEx rewriting patterns, and matching the data with similar patterns, and (3) at the Unification level, we apply information extraction and matching using unification or sub-graph isomorphism, and the structural data itself is generated by parsing using the stochastic unification-based grammars. Novelty and Expected Significance of the approach is based on improving methodology to provide for visual text analysis, i.e., visualization and closer interaction with the user, and for better adaptation and development of methodology for new kind of textual data and novel applications coming from the expansion of Internet data and social media. The significance of the approaches is supported by strong interest coming from industrial partners in the area of summarized analysis of social media data.
本研究的目的是在三个层次上推进自然语言处理的最新技术水平,以满足更好的信息处理需求。在最低层次,字符和单词n-gram级处理,我们的目标是通过使用可变长度n-gram配置文件来改进基于n-gram的文本挖掘,基于n-gram的可视文本分析,通过n-gram简档和对应的欧拉图的可视化,当前CNG距离测量与其他测量的比较(例如,Jaccard,Dice)在更深的模型级别上,使用Google N-gram数据来改进标准语言n-gram配置文件,并调整归一化Google距离以实现离线距离。在处理的中间层(基于RegEx),我们将推进正则表达式模式的开发,用于定向情感分析和噪声文本的解析,研究生成基于RegEx的模式的方法,从Google N-gram数据生成模式,并扩展Starfish系统用于文本嵌入式处理。在第三个层次,统一的水平,我们的目标是:子图同构技术从生物医学科学领域的分析,从社会媒体的信息收集,从维基百科数据的概念语义关系生成,和基于语义的可视化流文本数据,如可视化的电子邮件流。我们的方法是基于这三个层次的语言处理的前期工作:(1)公共N-Gram分析(CNG),其中文本数据使用字符n-gram配置文件建模;(2)基于正则表达式的文本数据处理,基于应用RegEx重写模式,并将数据与类似模式匹配,以及(3)在统一级别,我们使用统一或子图同构来应用信息提取和匹配,并且结构数据本身通过使用基于随机统一的语法的解析来生成。该方法的新奇性和预期意义是基于改进的方法学来提供视觉文本分析,即,可视化和与用户更密切的互动,以及更好地适应和发展新的文本数据和互联网数据和社交媒体扩展带来的新应用程序的方法。这些方法的重要性得到了来自社会媒体数据汇总分析领域的工业合作伙伴的强烈兴趣的支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Keselj, Vlado其他文献
Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests
- DOI:
10.1016/j.datak.2006.06.001 - 发表时间:
2007-05-01 - 期刊:
- 影响因子:2.5
- 作者:
Liu, Haibin;Keselj, Vlado - 通讯作者:
Keselj, Vlado
N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia
- DOI:
10.12720/jait.13.6.562-568 - 发表时间:
2022-12-01 - 期刊:
- 影响因子:1
- 作者:
Balfagih, Ahmed M.;Keselj, Vlado;Taylor, Stacey - 通讯作者:
Taylor, Stacey
Syllabification rules versus data-driven methods in a language with low syllabic complexity: The case of Italian
- DOI:
10.1016/j.csl.2009.02.004 - 发表时间:
2009-10-01 - 期刊:
- 影响因子:4.3
- 作者:
Adsett, Connie R.;Marchand, Yannick;Keselj, Vlado - 通讯作者:
Keselj, Vlado
Keselj, Vlado的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Keselj, Vlado', 18)}}的其他基金
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
- 批准号:
RGPIN-2019-05683 - 财政年份:2021
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
- 批准号:
RGPIN-2019-05683 - 财政年份:2020
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
- 批准号:
RGPIN-2019-05683 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Social Media Analytics for Effective and Efficient Event Detection
用于有效且高效事件检测的社交媒体分析
- 批准号:
523332-2018 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Engage Grants Program
Combined N-gram and Semantic Approach to Assignment Feedback Analysis and Generation
结合 N 元语法和语义方法进行作业反馈分析和生成
- 批准号:
507291-2016 - 财政年份:2016
- 资助金额:
$ 2.04万 - 项目类别:
Engage Grants Program
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2016
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Linking External Sources to Legal Contracts using Semantic Similarity
使用语义相似性将外部来源链接到法律合同
- 批准号:
490729-2015 - 财政年份:2015
- 资助金额:
$ 2.04万 - 项目类别:
Engage Grants Program
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2015
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Developing NLP Capabilities for Quality and Risk Attribute Analysis of Requirements Specifications and Technical Documentation
开发需求规范和技术文档的质量和风险属性分析的 NLP 能力
- 批准号:
490728-2015 - 财政年份:2015
- 资助金额:
$ 2.04万 - 项目类别:
Engage Grants Program
相似国自然基金
带应力string方法及其在材料计算中的应用
- 批准号:11001244
- 批准年份:2010
- 资助金额:17.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
- 批准号:
RGPIN-2019-05683 - 财政年份:2021
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
- 批准号:
RGPIN-2019-05683 - 财政年份:2020
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Harmonizing String and Unification-based Methodology with Machine Learning for Text Mining and Processing
将基于字符串和统一的方法与用于文本挖掘和处理的机器学习相协调
- 批准号:
RGPIN-2019-05683 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2016
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Unification of the standard model and the Planck scale physics based on constructive formulation of string theory
基于弦理论构造性表述的标准模型和普朗克尺度物理的统一
- 批准号:
16K05322 - 财政年份:2016
- 资助金额:
$ 2.04万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2015
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2014
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
String-based and Unification-based Methodology for Text mining and Processing
基于字符串和统一的文本挖掘和处理方法
- 批准号:
262059-2013 - 财政年份:2013
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Harmonized string-based and unification-based methodology for text mining and processing
用于文本挖掘和处理的基于字符串和统一的统一方法
- 批准号:
262059-2008 - 财政年份:2012
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual