Publications

Table of Contents

Scaling Learned Metric Index to 100M Datasets #

Learned indexing of high-dimensional data is an indexing approach that is still in the process of proving its viability – the Learned Metric Index (LMI) stands as one of the pioneering methods in this regard. Earlier implementation of LMI [Slanináková et al., SISAP 2023] primarily served as experimental prototype, operating under unrealistic assumptions, such as the availability of unlimited main memory or unbounded index construction time. Recently, however, LMI made the leap towards practical applicability on real-world datasets when it was successfully deployed to efficiently index 214 million protein structures for near-instantaneous retrieval [Procházka et al., Nucleic Acids Research 2024]. This paper details the key improvements that enabled this transition, including the introduction of parallel query processing (with the possibility of GPU acceleration), adaptive memory usage, pre-construction of memory buckets for contiguous access, a shift from k-means to spherical k-means clustering, and faster index construction through fewer epochs and the use of smaller training samples. LMI is now capable of handling 100M datasets and supports both in-memory and on-disk indexing, marking several important steps towards practical viability of AI-enhanced indexes for high-dimensional complex data in real-world settings.

AlphaFind: discover structure similarity across the proteome in AlphaFold DB #

AlphaFind is a web-based search engine that provides fast structure-based retrieval in the entire set of AlphaFold DB structures. Unlike other protein processing tools, AlphaFind is focused entirely on tertiary structure, automatically extracting the main 3D features of each protein chain and using a machine learning model to find the most similar structures. This indexing approach and the 3D feature extraction method used by AlphaFind have both demonstrated remarkable scalability to large datasets as well as to large protein structures. The web application itself has been designed with a focus on clarity and ease of use. The searcher accepts any valid UniProt ID, Protein Data Bank ID or gene symbol as input, and returns a set of similar protein chains from AlphaFold DB, including various similarity metrics between the query and each of the retrieved results. In addition to the main search functionality, the application provides 3D visualizations of protein structure superpositions in order to allow researchers to instantly analyze the structural similarity of the retrieved results. The AlphaFind web application is available online for free and without any registration at https://alphafind.fi.muni.cz.

SISAP 2023 Indexing Challenge – Learned Metric Index #

This submission into the SISAP Indexing Challenge examines the experimental setup and performance of the Learned Metric Index, which uses an architecture of interconnected learned models to answer similarity queries. An inherent part of this design is a great deal of flexibility in the implementation, such as the choice of particular machine learning models, or their arrangement in the overall architecture of the index. Therefore, for the sake of transparency and reproducibility, this report thoroughly describes the details of the specific Learned Metric Index implementation used to tackle the challenge.

Organizing Similarity Spaces Using Metric Hulls #

A novel concept of a metric hull has recently been introduced to encompass a set of objects by a few selected border objects. Following one of the metric-hull computation methods that generate a hierarchy of metric hulls, we introduce a metric index structure for unstructured and complex data, a Metric Hull Tree (MH-tree). We propose a construction of MH-tree by a bulk-loading procedure and outline an insert operation. With respect to the design of the tree, we provide an implementation of an approximate kNN search operation. Finally, we utilized the Profimedia dataset to evaluate various building and ranking strategies of MH-tree and compared the results with M-tree.