A Bi-metric Framework for Fast Similarity Search
Haike Xu*, Sandeep Silwal, Piotr Indyk
Preprint [pdf] [story] [video]
We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a *ground-truth* metric that is accurate but expensive to compute (e.g., embeddings from a large or expensive neural network to compare two sentences), and a *proxy* metric that is cheaper but less accurate (e.g., embeddings from a very small or local network). In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. In many cases, we can beat the popular "re-ranking" baseline through the use of some pretty cool graph-based data-structures!
Efficiently Computing Similarities to Private Datasets
Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, Jakub Tarnawski
ICLR 2024 [pdf] [story] [video]
Imagine a hospital which has a (private) dataset of patients with a particular ailment. We would like to measure the similarity between this dataset and any new patient in hopes of diagnosing them. The paper abstracts this natural problem as follows: given a datset $X \subset \mathbb{R}^d$, output a private datastructure $D$. On any query vector $y$, $D(y)$ computes $\sum_{x \in X} f(x,y)$ where $f$ is a desired similarity function (e.g. distance function or kernel function). We want $D$ itself to be private so it can handle an arbitrary number of queries (for ex, image if $D$ always outputs $42$. This is private but not very accurate so we want to do better). The paper explores several natural choices of $f$ and shows how to exploit the "geometry" of $f$ to create $D$ with low error.
Improved Frequency Estimation Algorithms with and without Predictions
Anders Aamand, Justin Y. Chen, Huy Nguyen, Sandeep Silwal, Ali Vakilian
Spotlight Presentation at NeurIPS 2023 [pdf] [story] [video]
We consider the fundamental problem of estimating frequencies of items arriving in a stream, given access to an oracle which predicts "heavy-hitters". Prior work gave theoretical guarantees for this setting under the further (natural) assumption that the underlying frequencies follow a Zipfian pattern. We show that under the Zipfian assumption, a much better frequency estimation algorithm already exists without the use of any predictions. Additionally incorporating predictions further improves its performance, as shown in our experiments.
A Near-Linear Time Algorithm for the Chamfer Distance
Ainesh Bakshi, Piotr Indyk, Rajesh Jayaram, Sandeep Silwal, Erik Waingarten
NeurIPS 2023 [pdf] [story] [video]
A popular measure of similarity between two large scale high-dimensional datasets is the optimal transport (or Earth Movers) distance. Unfortunately, this is a computationally expensive measure to compute. In this paper, we consider fast algorithms for computing the Chamfer distance, a relaxation of optimal transport. It is defined as the sum of nearest neighbor distances from every point in one dataset to the other. We give a linear time algorithm for computing this value and show that interestingly, outputting the underlying mapping (as opposed to just the value) between points that defines the Chamfer distance (conditionally) requires quadratic time.
Data Structures for Density Estimation
Anders Aamand, Alexandr Andoni, Justin Y Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal
ICML 2023 [pdf] [story] [slides] [video]
We consider the following problem: we are given knowledge of $k$ discrete distributions $v_i$ for $1 \le i \le k$ over the domain $[n] = \{1, \cdots ,n\}$ which we can preprocess. Then we get samples from an unknown discrete distribution $p$, also over $[n]$. The goal is to output the closest distribution to $p$ among the $v_i$’s in TV distance (up to some small additive error). The state of the art algorithms require $\Theta(\log k)$ samples and run in near linear time. We introduce a fresh perspective on the problem and ask if we can output the closest distribution in sublinear time. This question is particularly motivated as the problem is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn $p$ explicitly up to low TV distance and then find the closest $v_i$ in $o(k)$ time using standard nearest neighbor search algorithms. However, this approach requires $\Omega(n)$ samples. Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question.
KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals
Sandeep Silwal*, Sara Ahmadian, Andrew Nystrom, Andrew McCallum, Deepak Ramachandran, Seyed Mehran Kazemi
ICLR 2023 [pdf] [story] [slides]
The work explores if we can efficiently query large ML models, more precisely large language models, to solve concrete downstream tasks. We focus on clustering: we are given a set of $n$ sentences of text documents and query access to a large LLM which provides high-quality pairwise similarity scores for every pair of sentences of texts. Ideally, we would query the model on all $\Theta(n^2)$ pairs to obtain a high-quality similarity matrix, which can be used for many types of clustering algorithms downstream. However, this is not feasible in practice due to the prohibitive resource costs required to make a large number of queries. For ex, even if $n = 10^6$, finding the 'ground truth' similarities as computed by the model requires on the order of $10^{12}$ queries! We solve this challenging problem via an algorithmic lens. First we narrow down to the concrete clustering problem of correlation clustering. Then, we use the fact that often times, very efficient but lower quality hints/predictions are available for text similarity. Using the idiosyncrasies of correlation clustering coupled with readily available hints, we are able to query the model a very limited number of times while still obtaining a clustering comparable to the hypothetical case where querying all pairs is feasible. We also theoretically model the setting and show that for correlation clustering, a linear number of queries is sufficient to obtain close to the optimal clustering quality.
Faster Linear Algebra for Distance Matrices
Piotr Indyk, Sandeep Silwal
Oral Presentation at NeurIPS 2022 [pdf] [story] [slides] [video]
A distance matrix is defined as the matrix $A$ which records pairwise distances between $n$ data points in $\mathbb{R}^d$. They are ubiquitous in ML and data science as they capture similarity matrices and have applications in non-linear dimensionality reduction, visualization, clustering, etc. However, they require $\Omega(n^2)$ time and space for a dataset with $n$ points, making them quite difficult to compute on. We fill this gap and explore faster and scalable algorithms for distance matrices. There are three gems in the paper. (1) For the $\ell_1$ distance metric, we can compute $Ay$ for any vector $y$ in $O(nd)$ time, which is sublinear in the matrix size! Many foundational linear algebra algorithms (power method, iterative methods etc.) use matrix-vector products as the fundamental unit so $Ay$ the 'right' algorithmic primitive to consider. (2) The natural follow up question to (1) is if all metrics admit such fast matrix-vector speedups. This is not the case! We show that for the $\ell_{\infty}$ metric, compute $Ay$ for general $y$ requires $\Omega(n^2)$ time, assuming standard hardness assumptions. (3) Lastly, we consider the question of computing an approximate distance matrix, for example for the $\ell_2$ metric. The most natural way to do this, if one is familiar with some algorithmic tools, is to use the Johnson-Lindenstrauss (JL) lemma: simply project your points onto $O(\log n)$ dimensions and compute the distance matrix in the projected space, which requies $\Theta(n^2 \log n)$ time. Computing distances is in fact one of the most common useage of the JL lemma. We show that one can actually go beyond the JL lemma and comptue an approximate distance matrix in $O(n^2 \text{poly}(\log \log n))$ time! This is the result one would get if we could project onto $O(\text{poly}(\log \log n))$ dimensions using the JL lemma, but this is false as the JL lemma is tight! We use some nice metric compression results to obtian this result.
Faster Fundamental Graph Algorithms via Learned Predictions
Justin Y. Chen, Sandeep Silwal, Ali Vakilian, Fred Zhang
ICML 2022 [pdf] [story] [video]
We extend the work of Dinitz et al. (NeurIPS, 2021) and study graph algorithms with the help of learned LP variables. We obtain a better bound for bipartite matching by taking the Hungarian method inspired algorithm of Dinitz et al. and incorporating a a flow perspective. We also give a learned variant of the classic Goldberg's algorithm for shortest-paths with negative edges. Interestingly, the vertex potentials that are often used in shortest-path algorithms are also the variables of an appropriate LP. Lastly, we obtain a host of other learning-based results by utilizing standard (efficient) reductions between graph problems. An interesting development was that for the matching case, the challenge was to utilize a (feasible) set of dual predictions whereas for shortest-paths, the challenge was to round a given set of predictions to first be feasible. After that, optimizing is easy: just run Dijkstra's! Our experiments were also on an interesting dataset: graphs where nodes represent countries and edges are currency exchange rates. By taking the logs of the rates, finding the most efficient way to transfer one currency to another is a cool textbook example of shortest-paths in action.
Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering
Shyam Narayanan*, Sandeep Silwal*, Piotr Indyk, Or Zamir
ICML 2021 [pdf] [story] [video]
We obtain dimensionality reduction bounds for facility location clustering and the minimum spanning tree problem which go beyond the classical Johnson-Lindenstrauss (JL) lemma. For context, the work of Makarychev et al. (STOC 2019) showed that for $k$-means clustering, one can obtain a dimension bound in terms of $\log k$, rather than $\log n$ where $n$ is the number of points. In the problems we consider, there is no parameter "$k$" to parameterize the dimension bound. Instead we use the doubling dimension, which is a measure of intrinsic dimensionality of the dataset: it is at most $O(\log n)$ in the worst-case (i.e. JL bound) but can also be $O(1)$ regardless of $n$. To obtain our bounds, we exploit 'local' properties of our problems inspired by existing algorithms. The paper has a cool experiment and figure showing that doubling dimension is not just an artifact of analysis and its impact can actually be measured in practice.
Learning-based Support Estimation in Sublinear Time
Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner
Spotlight Presentation at ICLR 2021 [pdf] [story] [video] [poster]
We consider the sample complexity of estimating the support size of a discrete distribution drawn from a subset of a possibly large domain $[n]$. Without any advice, the sample complexity is $O(n/\log n)$, barely sublinear. In our model, we assume approximate probabilities of each sample as advice. With this augmentation, the sample complexity drops significantly to $O(n^c)$ for $c < 1$. Our experiments were based on internet search queries for which we could learn approximate underlying probabilities from historical data.
A TEDxBoston talk about being cautious of AI-hype.
A Forbes article about my work and the accompanying talk, given at CSAIL + Imagination in Action: AI Frontier & Implications event in celebration of CSAIL's 60th anniversary.