Publications of Edith Cohen

Papers and presentations sorted by topic.

(provided in postscript/compressed postcript/pdf/HTML/ppt/pdf formats)

Copyright Notice: Since most of these papers are published, the copyright has been transferred to the respective publishers. Therefore, the papers cannot be duplicated for commercial purposes. The following is ACM's copyright notice; other publishers have similar ones.

Copyright © 199x by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that new copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.

List of Topics: Items that fit in multiple categories are listed under the most relevant one.

Random samples as synopses of datasets

Random samples are flexible general-purpose synopses of data sets that are too massive to store in full, manipulate, or transmit. Answers to queries posed over the original data can be quickly estimated from the sample, and the same sample can be used for many types of queries. My work on sampling aims to facilitate a more effective use of sampling as synopses by designing sampling schemes and estimators for fundamental classes of queries.

Scalable design of sampling algorithms should depend on how the data is presented, to avoid the time and cost of aggregation. We distinguish between data sets presented as key value pairs (which can be streamed or distributed), data sets where there is a vector of values associated with each key and these values are dispersed across servers or time, and finally unaggregated data sets, where the data contains multiple elements with the same key and the value of the key is the sum of element weights.

The value of our sample hinges on the performance of our estimators. The sampling scheme should be geared to the statistics (queries) we would like supported and estimators should optimally use the information in the sample. I am also interested in understanding the limits of a sampling scheme: Characterize the queries we can estimate well and provide optimal estimators for a given query.

Sampling schemes for data presented as key value pairs

Multi-objective sampling

Common queries over key-value data sets are segment f-statistics, which are the sum over keys in a selected segment of a function f applied to the value. This formulation includes segment sum, counts, moments, thresholds, and capping statistics. A weighted sample computed with respect to a particular f provides estimaes with statistical guarantees for f-statistics. This is achieved by including each key with probability roughly proportional to f applied to its value. But estimation quality of g-statistics from a weighted sample computed with respect to f deteriorates when g is very different than f (for example, when f and g are sum and count, or are threshold or capping functions with very different parameters.) We study here multi-objective samples which provide estimates with statistical guarantees on quality for a set of different functions. In particular, we show how to construct efficiently a small sample which provides estimates with statistical guarantees for all monotone non-decreasing functions f.

Structure-aware sampling

Data is typically structured and queries conform to the structure. Sampling schemes, however, are oblivious to the structure. We propose sampling schemes that are structure-aware and allow for more accurate approximate range queries while retaining the benefits of sample-based synopses. Part of the work builds on the flexibility within the VarOpt family of distributions to gain structure-awareness.

VarOpt sampling

With classic Poisson Probability Proportional to Size (PPS) samples, the inclusion probability of each key is proportional to its weight and inclusions of different keys are independent. VarOpt samples have distinct advantages over Poisson sampling. They have PPS inclusion probabilities but also maintain a fixed-size sample and variance-optimal subset-sum estimators. We present efficient VarOpt stream-sampling algorithms. We also characterize a VarOpt family of sampling distribution, showing they all satisfy some important properties including Chernoff-Hoeffding bounds.

Bottom-k (order) sampling

Bottom-k sampling (in Statistics literature also known as order sampling) is performed by assigning keys random ranks that may depend on their value and including the k keys with minimum rank. By appropriately choosing rank distributions, bottom-k includes successive weighted sampling without replacement and Sequential Poisson (Priority) sampling. Bottom-k samples, like Poisson samples, (and as it seems, unlike VarOpt) can be coordinated . The sample has fixed size k, which is an advantage over Poisson samples and over with-replacement samples. We also explore the application of bottom-k sampling to the analysis of large graphs, expanding on my size-estimation work. We study data structures, sampling algorithms, and subset-sum estimators for bottom-k samples.

Sampling dispersed data: Coordinated sampling, Monotone sampling, Estimation

Data sets, such as request and traffic logs and sensor measurements, are repeatedly collected in multiple instances: time periods, locations, or snapshots. The data can be viewed as a matrix, with rows corresponding to different keys (users, sensors, cookies) and columns to different instances. Each entry is the value of a key in a given instance. Queries which span multiple instances, such as distinct counts and distance measures are used for planning, management, and anomaly or change detection. To scalably summarize such data, the sampling scheme of one instance should not depend on values in other instances. We would like, however, to use the same sample to estimate different statistics, which may depend on multiple instances. Since we are estimating a function from samples which provide only partial information on the set of values, the estimation problem is challenging. We study sampling schemes and estimation when instances are sampled independently and when the samples are coordinated.

Coordinated sampling is a method of sampling different instances so that the sampling of each instance is classic Poisson or bottom-k but samples of different instances are more similar when the instances are more similar. Statisticians proposed and used coordination as a method of controlling overlap in repeated surveys. Computer Scientists, myself included, (re-)discovered coordination for data analysis from samples: aggregates such as distinct counts and Jaccard Similarity can be better estimated when samples are coordinated. Coordination generalizes the technique popularly known as MinHash sketches. Sample coordination is a form of Locality Sensitive Hashing (LSH). I first came across coordination when observing that coordinated samples of all neighborhoods (or reachability sets) in a graph can be computed much more efficiently than independent samples. With time, I recognized the potential of coordination for large scale data analysis and opted to better understand it. We presented tighter estimators for basic queries, and more recently (see ), proposed a model of monotone sampling, which generalizes coordinated sampling, and provided a full characterization for the queries we can successfully answer along with efficient and practical optimal estimators.

In particular, since in general there may not exist one estimator with minimum variance for all possible data, we seek admissibility (Pareto variance optimality), meaning that estimators can not be improved for one data set in the domain without increasing variance for another. We also propose the stricter order-based optimality, where an order is specified over the data domain and we look for estimators which can not be improved for one data without increasing variance for preceding data.

Sampling unaggregated data: Streamed and distributed

Many data sets occur in an unaggregated form, where multiple data points are associated with each key. In the aggregated view of the data, the weight of a key is the sum of the weights of data points associated with the key and queries (such as heavy hitters, sum statistics, distinct counts) are posed over the aggregate view. Examples of unaggreagated data sets are IP packet streams where keys are flow keys, individual requests to resources where key is resource ID, interactions of users with Web services, where keys are cookies, and distributed streams of events registered by sensor networks, where keys are event types. Since data points are scattered in time or across multiple servers, aggregation is subject to resource limitation on storage and transmission. Therefore, we aim for efficient processing of queries and computing sample-based summaries directly over the unaggregated data. We consider both a model where the algorithm can "see" the complete data set (stream) and another model, where the data is subjected to per-entry fixed-rate sampling before it is made accessible to the summarization algorithm. This second model is motivated by sampled NetFlow, which was deployed in high speed IP routers. Also belongs in this category (but listed elsewhere) is the application to approximate distinct counting of my HIP estimators.

More with sampling

Algorithmic Game Theory: Envy Freeness

WWW performance

Freshness/Aging issues in the performance of Web caches

Improved Web user-perceived latency through reduced communication-establishment overhead

Improved Web performance by enhanced end-to-end communication

Cache replacement for Web pages: theory and experiments

Connection Caching: theory and experiments

Networking: Routing, Path selection, Measurements, Packet filtering

Content Distribution

Peer-to-Peer networks; replication and search in distributed datasets

Data: XML, Data streams, Spatial aggregation

Data Mining / Machine Learning

Algorithms for graphs and networks

Scalable Analysis of Massive Graphs

Graphs with billions of edges, such as social and Web graphs, are common and are a powerful source of information. The graph structure, alone or combined with meta-data, a static snapshot or evolving, encodes information on centralities, interests, similarities, influence, and communities. To effectively mine this information, algorithms must be highly scalable and models should accurately capture the real-world properties we are after.

I use a combination of algorithmic techniques, sketching and estimation techniques, modeling, and data sets, to develop effective tools for mining massive graphs.

Size-Estimation Scheme and Applications

My size-estimation scheme is based on a simple but powerful technique: the least-element (LE) method. This method is better known today as min-hash: Suppose you have subset from a universe of items. You (implicitly) assign a random permutation (or random values) to all items. We refer to the permutation rank or the random value assigned to an item as its rank. Now, the LE of a subset is the minimum rank of its members. The LE is in expectation smaller when there are more elements in the set. Therefore, it is possible to estimate the size of the set from its LE rank. When LE ranks of different subsets are obtained using the same random permutation, they are coordinated. Coordination facilitates many applications. In particular, it means that the LE ranks are mergeable: The LE of the union of subsets is the minimum of the LEs of its members. The LE ranks can also be used to estimate similarity of sets. For example, the probability that two sets have the same LE is proportional to their Jaccard similarity, which is the ratio of the intersection size to the union size. Therefore, the Jaccard similarity of the sets can be estimated from the LE similarity.

The accuracy of these size or similarity estimates can be enhanced by repeating this k times (using k permutations), by taking the k smallest ranks in a single permutation, or by hashing elements randomly to k buckets and taking the LE of each of the k buckets. I refer to these sketch flavors, respectively, as k-mins, bottom-k, or k-partition sketches. These three flavors have different advantages, depending on the tradeoffs we want to achieve. Asymptotically, the estimators with all three have the same behavior but bottom-k sketches carry the most information. In this early work (1994-1997), I studied mostly k-mins estimators (with mention of bottom-k sketches) and studied cardinality and similarity estimation.

The first application of the LE technique for size estimation I am aware of is for approximate distinct counting [Flajolet and Martin 1985]. As for similarity estimation, LE sketches can be viewed as a special case of coordinated samples [Brewer, Early, Joice 1972]. A very well known application of the LE technique is for near-duplicate detection of Web pages [Broder 1997].

I first applied the LE technique in a graph setting (see below): It turned out that (coordinated) sketches of all neighborhoods, and all reachability sets of all nodes in a graph can be very efficiently computed, with processing that is nearly-linear in graph size. This resulted in a powerful technique for analyzing very large graphs. Other early applications I explored (see below) are determining the optimal order to perform a chain multiplication of sparse matrices and sketch-based 1 tracking of the state of distributed processes for roll-back recovery.

In later papers, I use the term All-Distances sketch for the LE values with respect to all neighborhoods ("balls") of a node in the graph. We consider spatially-decaying agrregation , neighborhood variances , use of bottom-k sketches, and deriving even better estimators.

Distance and Reachability Labels

Shortest Paths: Trading Accuracy and Time

Parallel Algorithms: Shortest Paths and Maximum Flow

These algorithms, which I developed in the early 1990's, consider the idealized PRAM model (Parallel Random Access Machine) of parallel computation. The PRAM model assumes we have as many parallel processors as we can use, and considers the tradeoff between processing time and the total "work" we use (which is the product of time and number of processors). The principled exploration of the fundamental chains of dependencies, however, is relevant for the GPUs, multi-core, and map-reduce architecture which dominate massive computation today. Some of the structures I used also turn out to be relevant for the design of sketching, streaming, and dynamic sequential algorithms. Finally, the basic ideas are rather simple, not (only) aiming for improved worst-case bounds but also for simplicity and implementability.

Combinatorial Optimization / Linear Programming


Design of a CDMA decoder

Software tools

Publications in reverse chronological order

Other work (not yet peer reviewed):

Conference Papers:

Ph.D. Thesis:

Book chapters:

Journal Papers:

August, 2015
edith (at) cohenwang (period) com