A Document Processing System

文档处理系统

基本信息

  • 批准号:
    9160906
  • 负责人:
  • 金额:
    $ 43.97万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
  • 资助国家:
    美国
  • 起止时间:
  • 项目状态:
    未结题

项目摘要

A system of C++ language programs has been developed for the purpose of finding the closely related documents in Medline and for the purpose of performing machine learning on sets of documents. The system has a number of unique features: 1) It is based on a number of C++ classes and highly modular so that alterations in the system are relatively simple to perform. 2) The system currently processes PubMed data by extracting from the Sybase repositories using a C++ interface to Sybase. However, a change in the interface portion of the system would allow it to be applied to any large database consisting of discrete textual records. 3) Data processed by the system is stored as compressed file structures, etc. These structures are updatable so that new data may be continually added to the system as it becomes available. 4) Documents are compared with each other using a Bayesian form of analysis. 5) Code has been multithreaded and memory mapping capabilities added to speed up processing. 6) Most recently the code has been updated to work in a 64 bit environment. The system described here is now not only being used to process all of MEDLINE for our research purposes, but also to produce the related documents for arbitrary pieces of text by other groups here in the NLM and outside of the NLM. The system is currently proving useful in testing different retrieval parameters and methods on the PubMedHealth records. We have recently developed a software system called DStor that allows us to store all of PubMed in a manner which is easily updateable and allows fast access. This system is now being used to maintain and update five different versions of the PubMed data twice a week. This system has greatly improved our access to PubMed data in various useful forms and we anctipate that its use will continue to grow. In addition we have developed software to maintain and update a list of strings where each string is associated with some fixed vector of integers. We currently maintain a list of all multi-word phrases without stop words or punctuation and with each is associated a vector of six integers representing counts of different types associated with each phrase where counts are computed over all PubMed records having abstracts. We also maintain a list of all one and two word phrases and MeSH terms in various forms (with & without stars and subheadings) and two counts with each consisting of the document frequency and the total frequency counting all occurrences in each document over all of PubMed.
为了在Medline中找到密切相关的文档,并为了在文档集上执行机器学习的目的,开发了一个c++语言程序系统。该系统有许多独特的特点:1)它基于许多c++类,并且高度模块化,因此对系统的更改相对容易执行。2)系统目前处理PubMed数据的方式是从Sybase存储库中提取数据,使用一个c++的Sybase接口。但是,对系统的接口部分进行更改将使其能够应用于由离散文本记录组成的任何大型数据库。3)系统处理的数据以压缩文件结构等形式存储。这些结构是可更新的,因此新数据可以在可用时不断添加到系统中。4)使用贝叶斯分析形式对文件进行相互比较。5)代码已被多线程和内存映射功能,以加快处理速度。6)最近的代码已经更新到在64位环境下工作。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Willy Wilbur其他文献

Willy Wilbur的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Willy Wilbur', 18)}}的其他基金

A Document Processing System
文档处理系统
  • 批准号:
    8344939
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
Automatic Analysis and Annotation of Document Keywords in Biomedical Literature
生物医学文献中文档关键词的自动分析与标注
  • 批准号:
    8344960
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8558105
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
Natural Language Processing Techniques To Enhance Information Access.
增强信息访问的自然语言处理技术。
  • 批准号:
    8943224
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
PubMed Query Log Analysis and Use in Access Inhancement
PubMed 查询日志分析及其在访问增强中的使用
  • 批准号:
    7969244
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
Automatic Bayesian Methods In Text Retrieval
文本检索中的自动贝叶斯方法
  • 批准号:
    8149591
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    8149592
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8149602
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    7969199
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8344948
  • 财政年份:
  • 资助金额:
    $ 43.97万
  • 项目类别:

相似国自然基金

Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    合作创新研究团队
Data-driven Recommendation System Construction of an Online Medical Platform Based on the Fusion of Information
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    外国青年学者研究基金项目
Development of a Linear Stochastic Model for Wind Field Reconstruction from Limited Measurement Data
  • 批准号:
  • 批准年份:
    2020
  • 资助金额:
    40 万元
  • 项目类别:
基于Linked Open Data的Web服务语义互操作关键技术
  • 批准号:
    61373035
  • 批准年份:
    2013
  • 资助金额:
    77.0 万元
  • 项目类别:
    面上项目
Molecular Interaction Reconstruction of Rheumatoid Arthritis Therapies Using Clinical Data
  • 批准号:
    31070748
  • 批准年份:
    2010
  • 资助金额:
    34.0 万元
  • 项目类别:
    面上项目
高维数据的函数型数据(functional data)分析方法
  • 批准号:
    11001084
  • 批准年份:
    2010
  • 资助金额:
    16.0 万元
  • 项目类别:
    青年科学基金项目
染色体复制负调控因子datA在细胞周期中的作用
  • 批准号:
    31060015
  • 批准年份:
    2010
  • 资助金额:
    25.0 万元
  • 项目类别:
    地区科学基金项目
Computational Methods for Analyzing Toponome Data
  • 批准号:
    60601030
  • 批准年份:
    2006
  • 资助金额:
    17.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

FAIRification of multiOmics data to link databases and create knowledge graphs for fermented foods
多组学数据的公平化以链接数据库并创建发酵食品的知识图
  • 批准号:
    EP/Y032748/1
  • 财政年份:
    2024
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Research Grant
Advancing the implementation of variant-level functional data into clinical databases and clinical practice
推进变异级功能数据在临床数据库和临床实践中的实施
  • 批准号:
    10674373
  • 财政年份:
    2023
  • 资助金额:
    $ 43.97万
  • 项目类别:
Integrating third-party and open data with internal corporate databases
将第三方和开放数据与内部企业数据库集成
  • 批准号:
    542303-2019
  • 财政年份:
    2022
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Collaborative Research and Development Grants
Using Graph Databases in a Computational Knowledge Engine to Make Animal Health and Disease Data More "FAIR"
在计算知识引擎中使用图数据库使动物健康和疾病数据更加“公平”
  • 批准号:
    570288-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Integrating third-party and open data with internal corporate databases
将第三方和开放数据与内部企业数据库集成
  • 批准号:
    542303-2019
  • 财政年份:
    2021
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Collaborative Research and Development Grants
Data-driven CT image harmonization and hierarchical modeling in multi-institutional databases for musculoskeletal disease analysis
多机构数据库中数据驱动的 CT 图像协调和分层建模,用于肌肉骨骼疾病分析
  • 批准号:
    21K18080
  • 财政年份:
    2021
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
NSF Convergence Accelerator Track D: Rapid Development of Intelligent, Built Environment Geo-Databases Using AI and Data-Driven Models
NSF 融合加速器轨道 D:使用人工智能和数据驱动模型快速开发智能构建环境地理数据库
  • 批准号:
    2040735
  • 财政年份:
    2020
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Standard Grant
The DEPRESsion Screening Data (DEPRESSD) Project: Continuous Updating of Databases for Individual Participant Data Meta-analyses of Depression Screening Test Accuracy
抑郁症筛查数据 (DEPRESSD) 项目:持续更新个体参与者数据数据库抑郁症筛查测试准确性的荟萃分析
  • 批准号:
    438224
  • 财政年份:
    2020
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Operating Grants
EarthCube Data Capabilities: Solutions for Paleobotany: a web client hosting novel content and its integration with existing databases
EarthCube 数据功能:古植物学解决方案:托管新颖内容及其与现有数据库的集成的 Web 客户端
  • 批准号:
    2026961
  • 财政年份:
    2020
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Standard Grant
Integrating third-party and open data with internal corporate databases
将第三方和开放数据与内部企业数据库集成
  • 批准号:
    542303-2019
  • 财政年份:
    2020
  • 资助金额:
    $ 43.97万
  • 项目类别:
    Collaborative Research and Development Grants
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了