大语言模型(large language model,LLM)技术热潮对数据质量的要求提升到了一个新的高度.在现实场景中,数据通常来源不同且高度相关.但由于数据隐私安全问题,跨域异质数据往往不允许集中共享,难以被LLM高效利用.鉴于此,提出了一种LLM和...大语言模型(large language model,LLM)技术热潮对数据质量的要求提升到了一个新的高度.在现实场景中,数据通常来源不同且高度相关.但由于数据隐私安全问题,跨域异质数据往往不允许集中共享,难以被LLM高效利用.鉴于此,提出了一种LLM和知识图谱(knowledge graph,KG)协同的跨域异质数据查询框架,在LLM+KG的范式下给出跨域异质数据查询的一个治理方案.为确保LLM能够适应多场景中的跨域异质数据,首先采用适配器对跨域异质数据进行融合,并构建相应的知识图谱.为提高查询效率,引入线性知识图,并提出同源知识图抽取算法HKGE来实现知识图谱的重构,可显著提高查询性能,确保跨域异质数据治理的高效性.进而,为保证多域数据查询的高可信度,提出可信候选子图匹配算法Trust HKGM,用于检验跨域同源数据的置信度计算和可信候选子图匹配,剔除低质量节点.最后,提出基于线性知识图提示的多域数据查询算法MKLGP,实现LLM+KG范式下的高效可信跨域查询.该方法在多个真实数据集上进行了广泛实验,验证了所提方法的有效性和高效性.展开更多
查询推荐是一种帮助搜索引擎更好的理解用户检索需求的方法.基于查询的上下文片段训练词汇和查询之间的语义关系,同时结合查询和URL的点击图以及查询中的序列行为构建Term Query URL异构信息网络,采用重启动随机游走(Random Walk withR...查询推荐是一种帮助搜索引擎更好的理解用户检索需求的方法.基于查询的上下文片段训练词汇和查询之间的语义关系,同时结合查询和URL的点击图以及查询中的序列行为构建Term Query URL异构信息网络,采用重启动随机游走(Random Walk withRestart,RWR)进行查询推荐.综合利用语义信息和日志信息,提高了稀疏查询的推荐效果.基于概率语言模型构造查询的词汇向量,可以为新的查询进行查询推荐.在大规模商业搜索引擎查询日志上的实验表明本文方法相比传统的查询推荐方法性能提升约为3%~10%.展开更多
近年来,基于位置服务的技术迅猛发展,产生了海量的路网轨迹数据。而路径范围查询作为一种路网轨迹查询类型,是支持其他查询类型的基础。为了实现对海量路网轨迹数据的高效索引,同时提供精确的路径范围查询服务,提出了一种基于道格拉斯-...近年来,基于位置服务的技术迅猛发展,产生了海量的路网轨迹数据。而路径范围查询作为一种路网轨迹查询类型,是支持其他查询类型的基础。为了实现对海量路网轨迹数据的高效索引,同时提供精确的路径范围查询服务,提出了一种基于道格拉斯-普克算法的学习型索引结构(Douglas-Peuker Based Learned Index Structure,DPLI)。首先将轨迹数据分为多个轨迹段,然后取轨迹段中的点作为轨迹数据的表征,利用映射函数将其映射为一维映射值序列,而后根据键值数量将其划分为多个数据分片。在分片内将首尾数据组成一条线段,然后计算其余数据点距离线段的拟合误差,将超过误差阈值的数据点作为新的线段端点,递归分割原有的直线段,直到所有数据点的拟合误差小于阈值,从而拟合分段线性函数。采用多个路网数据和轨迹数据进行了充分的实验,实验结果表明:与传统索引方法相比,DPLI具有更快的构建效率和磁盘访问效率;与学习索引方法相比,DPLI保持了构建效率的优势,并且达到了100%查询召回率。展开更多
The idea of positional inverted index is exploited for indexing of graph database. The main idea is the use of hashing tables in order to prune a considerable portion of graph database that cannot contain the answer s...The idea of positional inverted index is exploited for indexing of graph database. The main idea is the use of hashing tables in order to prune a considerable portion of graph database that cannot contain the answer set. These tables are implemented using column-based techniques and are used to store graphs of database, frequent sub-graphs and the neighborhood of nodes. In order to exact checking of remaining graphs, the vertex invariant is used for isomorphism test which can be parallel implemented. The results of evaluation indicate that proposed method outperforms existing methods.展开更多
HashQuery,a Hash-area-based data dissemination protocol,was designed in wireless sensor networks. Using a Hash function which uses time as the key,both mobile sinks and sensors can determine the same Hash area. The se...HashQuery,a Hash-area-based data dissemination protocol,was designed in wireless sensor networks. Using a Hash function which uses time as the key,both mobile sinks and sensors can determine the same Hash area. The sensors can send the information about the events that they monitor to the Hash area and the mobile sinks need only to query that area instead of flooding among the whole network,and thus much energy can be saved. In addition,the location of the Hash area changes over time so as to balance the energy consumption in the whole network. Theoretical analysis shows that the proposed protocol can be energy-efficient and simulation studies further show that when there are 5 sources and 5 sinks in the network,it can save at least 50% energy compared with the existing two-tier data dissemination(TTDD) protocol,especially in large-scale wireless sensor networks.展开更多
文摘大语言模型(large language model,LLM)技术热潮对数据质量的要求提升到了一个新的高度.在现实场景中,数据通常来源不同且高度相关.但由于数据隐私安全问题,跨域异质数据往往不允许集中共享,难以被LLM高效利用.鉴于此,提出了一种LLM和知识图谱(knowledge graph,KG)协同的跨域异质数据查询框架,在LLM+KG的范式下给出跨域异质数据查询的一个治理方案.为确保LLM能够适应多场景中的跨域异质数据,首先采用适配器对跨域异质数据进行融合,并构建相应的知识图谱.为提高查询效率,引入线性知识图,并提出同源知识图抽取算法HKGE来实现知识图谱的重构,可显著提高查询性能,确保跨域异质数据治理的高效性.进而,为保证多域数据查询的高可信度,提出可信候选子图匹配算法Trust HKGM,用于检验跨域同源数据的置信度计算和可信候选子图匹配,剔除低质量节点.最后,提出基于线性知识图提示的多域数据查询算法MKLGP,实现LLM+KG范式下的高效可信跨域查询.该方法在多个真实数据集上进行了广泛实验,验证了所提方法的有效性和高效性.
文摘查询推荐是一种帮助搜索引擎更好的理解用户检索需求的方法.基于查询的上下文片段训练词汇和查询之间的语义关系,同时结合查询和URL的点击图以及查询中的序列行为构建Term Query URL异构信息网络,采用重启动随机游走(Random Walk withRestart,RWR)进行查询推荐.综合利用语义信息和日志信息,提高了稀疏查询的推荐效果.基于概率语言模型构造查询的词汇向量,可以为新的查询进行查询推荐.在大规模商业搜索引擎查询日志上的实验表明本文方法相比传统的查询推荐方法性能提升约为3%~10%.
文摘近年来,基于位置服务的技术迅猛发展,产生了海量的路网轨迹数据。而路径范围查询作为一种路网轨迹查询类型,是支持其他查询类型的基础。为了实现对海量路网轨迹数据的高效索引,同时提供精确的路径范围查询服务,提出了一种基于道格拉斯-普克算法的学习型索引结构(Douglas-Peuker Based Learned Index Structure,DPLI)。首先将轨迹数据分为多个轨迹段,然后取轨迹段中的点作为轨迹数据的表征,利用映射函数将其映射为一维映射值序列,而后根据键值数量将其划分为多个数据分片。在分片内将首尾数据组成一条线段,然后计算其余数据点距离线段的拟合误差,将超过误差阈值的数据点作为新的线段端点,递归分割原有的直线段,直到所有数据点的拟合误差小于阈值,从而拟合分段线性函数。采用多个路网数据和轨迹数据进行了充分的实验,实验结果表明:与传统索引方法相比,DPLI具有更快的构建效率和磁盘访问效率;与学习索引方法相比,DPLI保持了构建效率的优势,并且达到了100%查询召回率。
文摘The idea of positional inverted index is exploited for indexing of graph database. The main idea is the use of hashing tables in order to prune a considerable portion of graph database that cannot contain the answer set. These tables are implemented using column-based techniques and are used to store graphs of database, frequent sub-graphs and the neighborhood of nodes. In order to exact checking of remaining graphs, the vertex invariant is used for isomorphism test which can be parallel implemented. The results of evaluation indicate that proposed method outperforms existing methods.
基金Project(07JJ1010) supported by Hunan Provincial Natural Science Foundation of ChinaProjects(2006AA01Z202, 2006AA01Z199) supported by the National High-Tech Research and Development Program of China+2 种基金Project(7002102) supported by the City University of Hong Kong, Strategic Research Grant (SRG)Project(IRT-0661) supported by the Program for Changjiang Scholars and Innovative Research Team in UniversityProject(NCET-06-0686) supported by the Program for New Century Excellent Talents in University
文摘HashQuery,a Hash-area-based data dissemination protocol,was designed in wireless sensor networks. Using a Hash function which uses time as the key,both mobile sinks and sensors can determine the same Hash area. The sensors can send the information about the events that they monitor to the Hash area and the mobile sinks need only to query that area instead of flooding among the whole network,and thus much energy can be saved. In addition,the location of the Hash area changes over time so as to balance the energy consumption in the whole network. Theoretical analysis shows that the proposed protocol can be energy-efficient and simulation studies further show that when there are 5 sources and 5 sinks in the network,it can save at least 50% energy compared with the existing two-tier data dissemination(TTDD) protocol,especially in large-scale wireless sensor networks.