摘要
针对外存环境中海量高维数据近似最近邻(Approximate Nearest Neighbor,ANN)查询面临的"维度灾难"和I/O性能瓶颈难题,本文提出了一种基于最优排序的局部敏感哈希(Locality-Sensitive Hashing,LSH)索引方案O2LSH(Optimal Order LSH).通过引入空间填充曲线为复合哈希键值建立线序并排序,使近邻候选点更多地分布在相同或相邻磁盘页面,实现用少量顺序I/O加载到足够多的候选点.本文对多种常用空间曲线技术进行了量化分析,发现:(1)基本排序方案SK-LSH使用的row-wise曲线具有"维度优先遍历"的特性,容易对ANN查询造成多种局限;(2)另一类"邻域优先遍历"特性的曲线能够产生更好的候选点局部分布,且排序性能更加稳定.通过对比,我们选取了一种最优的"邻域优先遍历"曲线构造线序,该线序能够最大程度地改善近邻候选点的局部分布,进一步提升磁盘访问效率和查询精度.在多个真实多媒体数据集上进行了对比实验,证实了O2LSH相对于先进LSH方案(包括C2LSH、SK-LSH、SRS以及QALSH)在查询精度和I/O效率上的优越性.特别地,O2LSH克服了基本排序方案SK-LSH对LSH关键参数的敏感性,算法实用性进一步提升.
Nearest neighbor(NN)search in high-dimensional space is a fundamental paradigm in a wide range of applications,such as text information retrieval,search engine,content-based information query,duplication detection,etc.In these areas,data are usually large-scale and are modeled as high-dimensional features,which introduce two major problems to NN search,the“curse of dimensionality”and I/O performance bottleneck.On the one hand,the dimensionality curse makes exact NNs nearly infeasible to achieve.A lot of research efforts have been devoted to finding approximate nearest neighbors(ANN)which are close enough to the query to achieve a satisfying trade-off between accuracy and efficiency.On the other hand,large-scale feature sets are often too massive to fit into the internal memory,external storage(usually disk)becomes a reasonable choice.However,due to the huge speed gap between internal memory and external memory,the resulting input/output communication becomes very expensive,too much NN candidates or improper loading manner would make the NN candidates loading the most time-consuming part of the entire NN search,called the I/O performance bottleneck.LSH is a widely adopted technique due to its excellent error guarantee and high computational efficiency in tackling the“curse of dimensionality”.It can enable fast and accurate irrelevant points filtration and offers c-ANN results at a sub-linear time complexity which also makes it an attractive approach for disk-based ANN search.Plenty of LSH based methods have been developed to further boost the ANN performance.However,most of them access candidate objects through significant random I/O operations,which makes them tend to incur the I/O performance bottleneck.In this work,a novel Locality-Sensitive Hashing(LSH)index called Optimal Order LSH(O2LSH)is designed to further address the above two problems.First,O2LSH introduces space-filling curves to sort the compound LSH keys and rearrange the original data points accordingly.In this way,NN candidates can be stored on same or adjacent disk pages so that only a few sequential I/O operations can load enough candidates during the search.A thorough quantitative analysis is then conducted on several common space-filling curves.The results show that there exists two different kinds of characteristic among these space-filling curves,(1)curves of“dimension-first-traverse”characteristic(such as the row-wise curve)tend to introduce several limitations to ANN search;(2)curves of“neighborhood-first-traverse”characteristic(such as Z-order,Gray curve,Hilbert curve,etc.)can produce better local distribution of NN candidate points,and the performance are more stable.Based on the analysis,O2LSH chooses a best“neighborhood-first-traverse”curve as the linear order to maintain as many as NN candidates within same local disk pages.In this way,O2LSH can not only enhance the I/O efficiency but also improve the ANN search accuracy.We conduct empirical experiments on 6 real-world multimedia data sets.The results demonstrate the superior accuracy and I/O efficiency of O2LSH in ANN search,compared with 4 state-of-the-art methods,including C2LSH,SK-LSH,SRS and QALSH.Particularly,O2LSH is no longer sensitive to a key parameter of LSH function as the basic sorting-based solution does,which further improves the algorithm practicality.
作者
冯小康
彭延国
崔江涛
刘英帆
李辉
FENG Xiao-Kang;PENG Yan-Guo;CUI Jiang-Tao;LIU Ying-Fan;LI Hui(School of Computer Science and Technology,Xidian University,Xi’an 710071;Department of System Engineering and Engineering Management,Chinese University of Hong Kong,Hong Kong,China999077;School of Cyber Engineering,Xidian University,Xi’an 710071;National Engineering Laboratory for Public Security Risk Perception and Control by Big Data,Beijing 100084)
出处
《计算机学报》
EI
CSCD
北大核心
2020年第5期930-947,共18页
Chinese Journal of Computers
基金
国家自然科学基金(61976168,61702403,61672408,61972309)
国家111计划(B16037)
社会安全风险感知与防控大数据应用国家工程实验室主任基金项目
CCF-华为数据库创新研究计划项目(CCFHuaweiDBIR008B)
中国博士后科学基金(2018M633473)
陕西省自然科学基本研究计划(2015JQ6227,2018JM6073,2019CGXNG-023)
江西省重点研发计划(20181ACE50029)
中央高校基本科研业务基金(XJS190305,JB181505)
西安电子科技大学研究生创新基金资助。
关键词
近似最近邻
高维索引
局部敏感哈希
空间线序
局部分布
approximate nearest neighbor
high-dimensional index
locality-sensitive hashing
linear order
local distribution
作者简介
冯小康,博士研究生,中国计算机学会(CCF)会员,主要研究方向为高维索引与近似查询.E-mail:research@xkfeng.com;彭延国,博士,讲师,中国计算机学会(CCF)高级会员,主要研究方向为安全查询、隐私保护、云计算安全;通信作者:崔江涛,博士,教授,中国计算机学会理事,杰出会员,主要研究领域为数据与知识工程、大规模数据管理、数据安全与隐私保护.E-mail:cuijt@xidian.edu.cn;刘英帆,博士研究生,中国计算机学会(CCF)会员,主要研究方向为大规模复杂数据管理、高维数据相似性搜索.李辉,博士,教授,中国计算机学会(CCF)高级会员,主要研究领域为社会计算、知识发现、图挖掘和大数据中的隐私保护.