Improved Frequency Estimation Algorithms with and without Predictions
Anders Aamand, Justin Y. Chen, Huy Nguyen, Sandeep Silwal, Ali Vakilian
Spotlight Presentation at NeurIPS 2023 [pdf] [story] [video]
We consider the fundamental problem of estimating frequencies of items arriving in a stream, given access to an oracle which predicts "heavy-hitters". Prior work gave theoretical guarantees for this setting under the further (natural) assumption that the underlying frequencies follow a Zipfian pattern. We show that under the Zipfian assumption, a much better frequency estimation algorithm already exists without the use of any predictions. Additionally incorporating predictions further improves its performance, as shown in our experiments.
Learning-Augmented Algorithms for Online Linear and Semidefinite Programming
Elena Grigorescu, Young-San Lin, Sandeep Silwal, Maoyuan Song, Samson Zhou
NeurIPS 2022 [pdf]
Faster Fundamental Graph Algorithms via Learned Predictions
Justin Y. Chen, Sandeep Silwal, Ali Vakilian, Fred Zhang
ICML 2022 [pdf] [story] [video]
We extend the work of Dinitz et al. (NeurIPS, 2021) and study graph algorithms with the help of learned LP variables. We obtain a better bound for bipartite matching by taking the Hungarian method inspired algorithm of Dinitz et al. and incorporating a a flow perspective. We also give a learned variant of the classic Goldberg's algorithm for shortest-paths with negative edges. Interestingly, the vertex potentials that are often used in shortest-path algorithms are also the variables of an appropriate LP. Lastly, we obtain a host of other learning-based results by utilizing standard (efficient) reductions between graph problems. An interesting development was that for the matching case, the challenge was to utilize a (feasible) set of dual predictions whereas for shortest-paths, the challenge was to round a given set of predictions to first be feasible. After that, optimizing is easy: just run Dijkstra's! Our experiments were also on an interesting dataset: graphs where nodes represent countries and edges are currency exchange rates. By taking the logs of the rates, finding the most efficient way to transfer one currency to another is a cool textbook example of shortest-paths in action.
Triangle and Four Cycle Counting with Predictions in Graph Stream
Justin Y. Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David Woodruff, Michael Zhang
ICLR 2022 [pdf] [story] [slides] [code]
Counting triangles in a graph stream is hard: we might accidentally miss 'important' edges in the stream which contribute a lot to the global triangle count. This is where a predictor comes in: if we are given hints on which edges are important, the problem can be mitigated by simply keeping them, leading to improved space bounds. While simple, the observation is quite natural. In practice, there are graph datasets which are slow varying across time (think social network for ex). Theoretically, it turns out many prior algorithms can be reframed in the viewpoint of having access to a predictor.
See the supplementary material here.
Learning-Augmented $k$-means Clustering
Jon Ergun, Zhili Feng, Sandeep Silwal, David Woodruff, Samson Zhou
Spotlight Presentation at ICLR 2022 [pdf] [story] [slides]
We consider the problem of $k$-means clustering with the help of a predictor which outputs 'noisy' labels for each point. If the predictor's hints are based on a sufficiently accurate clustering, we can obtain a clustering which can be arbitrarily close to optimum, thereby surpassing known approximation lower bounds (without predictions). Interestingly, it is not sufficient to just blindly follow the predictor's advice and one must post-process the hints. Our algorithms are inspired from robust mean estimation. The fact that we can overcome worst-case NP-hardness via learning is very cool to me and is something that should be explored more.
Learning-based Support Estimation in Sublinear Time
Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner
Spotlight Presentation at ICLR 2021 [pdf] [story] [video] [poster]
We consider the sample complexity of estimating the support size of a discrete distribution drawn from a subset of a possibly large domain $[n]$. Without any advice, the sample complexity is $O(n/\log n)$, barely sublinear. In our model, we assume approximate probabilities of each sample as advice. With this augmentation, the sample complexity drops significantly to $O(n^c)$ for $c < 1$. Our experiments were based on internet search queries for which we could learn approximate underlying probabilities from historical data.
A Bi-metric Framework for Fast Similarity Search
Haike Xu*, Sandeep Silwal, Piotr Indyk
Preprint [pdf] [story] [video]
We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a *ground-truth* metric that is accurate but expensive to compute (e.g., embeddings from a large or expensive neural network to compare two sentences), and a *proxy* metric that is cheaper but less accurate (e.g., embeddings from a very small or local network). In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. In many cases, we can beat the popular "re-ranking" baseline through the use of some pretty cool graph-based data-structures!
Efficiently Computing Similarities to Private Datasets
Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, Jakub Tarnawski
ICLR 2024 [pdf] [story] [video]
Imagine a hospital which has a (private) dataset of patients with a particular ailment. We would like to measure the similarity between this dataset and any new patient in hopes of diagnosing them. The paper abstracts this natural problem as follows: given a datset $X \subset \mathbb{R}^d$, output a private datastructure $D$. On any query vector $y$, $D(y)$ computes $\sum_{x \in X} f(x,y)$ where $f$ is a desired similarity function (e.g. distance function or kernel function). We want $D$ itself to be private so it can handle an arbitrary number of queries (for ex, image if $D$ always outputs $42$. This is private but not very accurate so we want to do better). The paper explores several natural choices of $f$ and shows how to exploit the "geometry" of $f$ to create $D$ with low error.
A Near-Linear Time Algorithm for the Chamfer Distance
Ainesh Bakshi, Piotr Indyk, Rajesh Jayaram, Sandeep Silwal, Erik Waingarten
NeurIPS 2023 [pdf] [story] [video]
A popular measure of similarity between two large scale high-dimensional datasets is the optimal transport (or Earth Movers) distance. Unfortunately, this is a computationally expensive measure to compute. In this paper, we consider fast algorithms for computing the Chamfer distance, a relaxation of optimal transport. It is defined as the sum of nearest neighbor distances from every point in one dataset to the other. We give a linear time algorithm for computing this value and show that interestingly, outputting the underlying mapping (as opposed to just the value) between points that defines the Chamfer distance (conditionally) requires quadratic time.
Data Structures for Density Estimation
Anders Aamand, Alexandr Andoni, Justin Y Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal
ICML 2023 [pdf] [story] [slides] [video]
We consider the following problem: we are given knowledge of $k$ discrete distributions $v_i$ for $1 \le i \le k$ over the domain $[n] = \{1, \cdots ,n\}$ which we can preprocess. Then we get samples from an unknown discrete distribution $p$, also over $[n]$. The goal is to output the closest distribution to $p$ among the $v_i$’s in TV distance (up to some small additive error). The state of the art algorithms require $\Theta(\log k)$ samples and run in near linear time. We introduce a fresh perspective on the problem and ask if we can output the closest distribution in sublinear time. This question is particularly motivated as the problem is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn $p$ explicitly up to low TV distance and then find the closest $v_i$ in $o(k)$ time using standard nearest neighbor search algorithms. However, this approach requires $\Omega(n)$ samples. Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question.
KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals
Sandeep Silwal*, Sara Ahmadian, Andrew Nystrom, Andrew McCallum, Deepak Ramachandran, Seyed Mehran Kazemi
ICLR 2023 [pdf] [story] [slides]
The work explores if we can efficiently query large ML models, more precisely large language models, to solve concrete downstream tasks. We focus on clustering: we are given a set of $n$ sentences of text documents and query access to a large LLM which provides high-quality pairwise similarity scores for every pair of sentences of texts. Ideally, we would query the model on all $\Theta(n^2)$ pairs to obtain a high-quality similarity matrix, which can be used for many types of clustering algorithms downstream. However, this is not feasible in practice due to the prohibitive resource costs required to make a large number of queries. For ex, even if $n = 10^6$, finding the 'ground truth' similarities as computed by the model requires on the order of $10^{12}$ queries! We solve this challenging problem via an algorithmic lens. First we narrow down to the concrete clustering problem of correlation clustering. Then, we use the fact that often times, very efficient but lower quality hints/predictions are available for text similarity. Using the idiosyncrasies of correlation clustering coupled with readily available hints, we are able to query the model a very limited number of times while still obtaining a clustering comparable to the hypothetical case where querying all pairs is feasible. We also theoretically model the setting and show that for correlation clustering, a linear number of queries is sufficient to obtain close to the optimal clustering quality.
Sub-quadratic Algorithms for Kernel Matrices via Kernel Density Estimation
Ainesh Bakshi, Praneeth Kacham, Piotr Indyk, Sandeep Silwal, Samson Zhou
Spotlight Presentation at ICLR 2023 [pdf] [story] [video and slides]
We noticed there were many works on designing awesome datastructures for kernel density estimation: they preprocess a dataset and output a datastructure which approximates the kernel density value at any future query in sublinear time. However, there were not many works on using these powerful datastructures in downstream applications. In this work, we show how to use these datastructures to perform fast sampling on kernel graphs: weighted graphs whose edges are the pairwise kernel values between datapoints. Our sampling algorithms are simple and lead to many implications for eigenvector estimation, low-rank approximation, and clustering to name a few.
Faster Linear Algebra for Distance Matrices
Piotr Indyk, Sandeep Silwal
Oral Presentation at NeurIPS 2022 [pdf] [story] [slides] [video]
A distance matrix is defined as the matrix $A$ which records pairwise distances between $n$ data points in $\mathbb{R}^d$. They are ubiquitous in ML and data science as they capture similarity matrices and have applications in non-linear dimensionality reduction, visualization, clustering, etc. However, they require $\Omega(n^2)$ time and space for a dataset with $n$ points, making them quite difficult to compute on. We fill this gap and explore faster and scalable algorithms for distance matrices. There are three gems in the paper. (1) For the $\ell_1$ distance metric, we can compute $Ay$ for any vector $y$ in $O(nd)$ time, which is sublinear in the matrix size! Many foundational linear algebra algorithms (power method, iterative methods etc.) use matrix-vector products as the fundamental unit so $Ay$ the 'right' algorithmic primitive to consider. (2) The natural follow up question to (1) is if all metrics admit such fast matrix-vector speedups. This is not the case! We show that for the $\ell_{\infty}$ metric, compute $Ay$ for general $y$ requires $\Omega(n^2)$ time, assuming standard hardness assumptions. (3) Lastly, we consider the question of computing an approximate distance matrix, for example for the $\ell_2$ metric. The most natural way to do this, if one is familiar with some algorithmic tools, is to use the Johnson-Lindenstrauss (JL) lemma: simply project your points onto $O(\log n)$ dimensions and compute the distance matrix in the projected space, which requies $\Theta(n^2 \log n)$ time. Computing distances is in fact one of the most common useage of the JL lemma. We show that one can actually go beyond the JL lemma and comptue an approximate distance matrix in $O(n^2 \text{poly}(\log \log n))$ time! This is the result one would get if we could project onto $O(\text{poly}(\log \log n))$ dimensions using the JL lemma, but this is false as the JL lemma is tight! We use some nice metric compression results to obtian this result.
Differentially Private Gomory-Hu Trees
Anders Aamand, Justin Y. Chen, Mina Dalirrooyfard, Slobodan Mitrović, Yuriy Nevmyvaka, Sandeep Silwal, Yinzhan Xu
Preprint [pdf] [story]
Computing various notions of graph cuts under differential privacy (DP) has been extensively studied. Our paper fills in the following gap in this literature: ignoring privacy parameters, computing a global min-cut only incurs $\tilde{O}(1)$ error for pure DP. Computing all possible cuts incurs $\tilde{O}(n^{3/2}$ error, and computing a single min-cut incurs $\tilde{O}(n)$ error. Thus, it is natural to ask what happens if we want to compute all pairs min-cut. We show that privately reporting all pair min-cuts only incurs $\tilde{O}(n)$ error, which aligns with the single minimum-cut case. Along the way, we also release a private Gomory-Hu tree, which neatly captures all minimum cuts in a tree structure. An interesting open question raised by our work is if $\tilde{O}(n)$ error can be improved on if we just want all pair min-cut values, rather than the cuts themselves.
Learned Interpolation for Better Streaming Quantile Approximation with Worst Case Guarantees
Nicholas Schiefer*, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Tal Wagner
ACDA 2023 [pdf] [story]
We improve the empirical performance of streaming quantile estimation algorithms while retaining worst-case guarantees. The main (and simple) idea is to approximate the quantiles in a stream via linear functions. This is motivated by the fact that many CDFs in the real world are quite smooth, meaning a linear interpolation obtians better approximation compared to step functions, which the state of the art theoretical algorithms use. While the idea is simple, we had a challenging time inacting it in practice to ensure both practical efficiency and theoretical gurantees. We ultimately suceeded by considering an algorithm which starts off as the standard theoretical algorithms, but switches to a linear interplation at an intermediate step, obtaining 'the best of both worlds'.
The White-Box Adversarial Data Stream Model
Miklos Ajtai, Vladimir Braverman, T.S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, Samson Zhou
PODS 2022 [pdf] [story] [slides]
Streaming algorithms in the presence of an adaptive adversary has been well studied recently. Most papers are in the so called 'black-box' model where the adversary can only observe outputs of the algorithm to design future inputs. Motivated by real-world adversarial attacks, we consider a "white-box" model where the adversary can now also observe the internal randomness used by the algorithm. Unsurprisingly, many tasks become impossible without storing the entire stream. However, one can still obtain non-trivial algorithms. One example is if we assume the adversary is computationally bounded, then we can use the 'shortest integer solution' (SIS) problem to design streaming algorithms: the SIS matrix is used as a sketching matrix. Even if the adversary knows this matrix, it cannot find integer vectors in the kernel which is a property we exploit. More broadly, we employ collision-resistant hash functions as well as the classic Morris counter to design white-box robust algorithms.
Adversarial Robustness of Streaming Algorithms through Importance Sampling
Vladimir Braverman, Avinatan Hassidim, Yossi Matias, Mariano Schain, Sandeep Silwal, Samson Zhou
Silver Best Paper Award at Adversarial ML Workshop at ICML 2021; Final version at NeurIPS 2021
[pdf] [story] [video] [slides]
Adversarial robustness has garnered much recent interest in both the ML and TCS communities. Streaming algorithms are a good intermediate step: they have ML applications while being amenable to theoretical study. The adversary can adaptively design future stream elements after observing the algorithm's response to past inputs. Our simple finding is that existing streaming algorithms, e.g. those based on coresets and sampling based methods, are already robust to the stated model of adversaries. This has wide applications such as in numerical linear algebra and clustering. I believe our experiments are also interesting: they highlight the vulnerability of streaming algorithms implemented in standard and widely used software libraries such as Apache Spark.
Property Testing of LP-Type Problems
Rogers Epstein, Sandeep Silwal
ICALP 2020 [pdf] [story] [video]
We must determine if a constrained optimization problem is feasible or `far' from feasible while having random access to few constraints of the problem. A cool example is determining if a set of high-dimensional points, which are colored red and blue, are linearly seperable, or if a constant fraction of points need to be removed for the set to be seperable. Here the constraints correspond to having random access to the points. We show that for a natural class of optimization problems, called LP-Type problems, one only needs random access to a number of constraints which is independent of the size of the problem (and only depends on the dimension). The key idea underlying our analysis is the fact that for LP-type problems, which generalize LPs, few constraints already determine feasibility (for ex. consider the 'basis' of a LP). Such ideas were already used in prior works such as Clarkson's classic randomized algorithm for LPs and we port them over to property testing.
Testing Properties of Multiple Distributions with Few Samples
Maryam Aliakbarpour, Sandeep Silwal
ITCS 2020 [pdf] [story] [video]
We consider a noisy version of distribution testing where each sample can be drawn from a possibly different distribution. While this is hopeless in general, we study natural models of similarity across distributions where one can obtain meaningful results.
See the first 20 mins. here
Dimensionality Reduction for Wasserstein Barycenter
Zachary Izzo, Sandeep Silwal, Samson Zhou
NeurIPS 2021 [pdf] [story] [video]
We obtain dimensionality reduction bounds for the Wasserstein barycenter problem which go beyond the classical Johnson-Lindenstrauss (JL) lemma. The key idea is to realte the problem to (constrained) $k$-means clustering for which such bounds have been known, e.g., see Makarychev et al. (STOC 2019). Surprisingly, one cannot improve upon the JL lemma if we consider the related problem of optimal transport between two discrete distributions.
Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering
Shyam Narayanan*, Sandeep Silwal*, Piotr Indyk, Or Zamir
ICML 2021 [pdf] [story] [video]
We obtain dimensionality reduction bounds for facility location clustering and the minimum spanning tree problem which go beyond the classical Johnson-Lindenstrauss (JL) lemma. For context, the work of Makarychev et al. (STOC 2019) showed that for $k$-means clustering, one can obtain a dimension bound in terms of $\log k$, rather than $\log n$ where $n$ is the number of points. In the problems we consider, there is no parameter "$k$" to parameterize the dimension bound. Instead we use the doubling dimension, which is a measure of intrinsic dimensionality of the dataset: it is at most $O(\log n)$ in the worst-case (i.e. JL bound) but can also be $O(1)$ regardless of $n$. To obtain our bounds, we exploit 'local' properties of our problems inspired by existing algorithms. The paper has a cool experiment and figure showing that doubling dimension is not just an artifact of analysis and its impact can actually be measured in practice.
Using Dimensionality Reduction to Optimize t-SNE
Rikhav Shah, Sandeep Silwal
OPTML Workshop at NeurIPS 2019 [pdf] [story] [poster]
t-SNE is a popular visualization tool which maps high dimensional data into two or three dimensions. Unfortunately it is also extremely slow. We propose a somewhat counterintuitive strategy to mitigate this: doing a dimensionality reduction (such as PCA) before applying t-SNE (which itself is dimensionality reduction). Qualitatively this does not seem to make a difference to the t-SNE output while making t-SNE up to an order of magnitude faster.
Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks
Anders Aamand, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Nicholas Schiefer, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner
NeurIPS 2022 [pdf][story] [slides]
The Weisfeiler-Lehman (WL) test is a popular heuristic for graph isomorphism. Its claim to fame in the GNN community is the theoretical characterization of Xu et al. which states that message passing GNNs can simulate the WL test and furthermore, these GNNs can only distinguish graphs as well as the WL test. Such works provide a nice qualitative understanding of the expressibility powers of GNNs, but leave opon the question of quantitative understanding which our works fills. We provide an almost tight understanding of the sizes for GNNs (in terms of the number of ReLu units and message size used) to simulte the WL test. The main highlight result is that one can use an almost logarithmic (in the size of the input graph) number of ReLu units and message size to perform the simulation, which provides an exponential improvement over prior works. Along the way we construct a nice primitive for performing hashing using GNNs.
Hardness and Algorithms for Robust and Sparse Optimization
Eric Price, Sandeep Silwal, Samson Zhou
ICML 2022 [pdf] [story] [slides]
Optimization problems where one requires sparsity of variables, such as sparse regression or sparse PCA, are an important and well-studied class of problems. Many of these problems are known to be hard in the worst case, and the best theoretical algorithms have an exponential dependence on $k$, the sparsity parameter. We consider a natural version where we are promised that an optimal solution exists. Given this, we design algorithms for such problems with a reduced exponential dependence of $k/2$, all using the same underlying idea. To illustrate this, suppose we want to solve $Ax=b$ where we are promised a $k$-sparse solution $x$ exists. We first guess over all $k/2$ sparse vectors. We then "match" a fixed guess $y$ to the "correct" other $k/2$ sparse vector using fast nearest neighbor search data structures on the query $b-Ay$. This is the classical algorithmic idea of "meet in the middle." We also study (fine-grained) hardness of various formulations of sparse optimization problems and give reductions between sparse and robust regression, filling a gap in the literature.
Constant Approximation for Individual Preference Stable Clustering
Anders Aamand, Justin Y. Chen, Allen Liu, Sandeep Silwal, Pattara Sukprasert, Ali Vakilian, Fred Zhang
Spotlight Presentation at NeurIPS 2023 [pdf] [video]
Robust Algorithms on Adaptive Inputs from Bounded Adversaries
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Fred Zhang, Qiuyi Zhang, Samson Zhou
ICLR 2023 [pdf] [story] [poster]
Inspired by related works in streaming, we use differentiall privacy to design adversarially robust algorithms in many offline settings. The cool observation used is that we can view the internal randomness of randomized algorithms as the 'private' dataset which we wish to protect from adversaries via differential privacy.
Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Samson Zhou
SODA 2023 [pdf]
Motif Cut Sparsifiers
Michael Kapralov, Mikhail Makarov, Sandeep Silwal, Christian Sohler, Jakab Tardos
FOCS 2022 [pdf] [story]
Much work has gone into sparsifying dense graphs to preserve the size of cuts. In this work, we ask a new question: can we sparsify graphs to preserve triangle counts, or in general counts of small subgraphs, across cuts?
Smoothed Analysis of the Condition Number Under Low-Rank Perturbations
Rikhav Shah, Sandeep Silwal
RANDOM 2021 [pdf] [story] [video]
Most works in smoothed analysis assume dense entry-wise perturbation of the input variables. We take a step back and consider a new model of low-rank perturbations. Our main result is that adding a random rank $k$ matrix to a fixed matrix of rank at-least $n-k$ produces a well-conditioned matrix.
A Concentration Inequality for the Facility Location Problem
Sandeep Silwal
Operations Research Letters, Volume 50 [pdf] [story]
How well does the objective value of a random instance of the (uniform) facility location problem concentrate? The answer is quite surprising. In the unit square with $n$ uniform points, the interval of concentration is at most $O(n^{1/6})$, an odd exponent. The argument requires Talagrand's concentration inequality which heavily exploits the 'local' structure of the facility location problem.
Improved Space Bounds for Learning with Experts
Anders Aamand, Justin Y. Chen, Huy Lê Nguyễn, Sandeep Silwal
ACDA 2023 (Poster) [pdf]
A note on the universality of ESDs of inhomogeneous random matrices
Vishesh Jain, Sandeep Silwal
Latin American Journal of Probability and Mathematical Statistics [pdf]
Directed Random Geometric Graphs
Jesse Michel, Ramis Movassagh, Sushruth Reddy, Rikhav Shah, Sandeep Silwal
Journal of Complex Networks [pdf] [story]
A novel model of random digraphs for real-world directed networks inspired by the classical random geometric graph model. Interestingly, the in-degree and out-degree distributions follow different laws (power-law vs binomial).