Improved Algorithms for Time Decay Streams

Braverman, Vladimir; Lang, Harry; Ullah, Enayat; Zhou, Samson

doi:10.4230/LIPIcs.APPROX-RANDOM.2019.27

Abstract

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions.
We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well.

Charu C. Aggarwal, editor. Data Streams - Models and Algorithms, volume 31 of Advances in Database Systems. Springer, 2007.
Arvind Arasu and Gurmeet Singh Manku. Approximate Counts and Quantiles over Sliding Windows. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 286-296, 2004.
Sepehr Assadi and Sanjeev Khanna. Randomized Composable Coresets for Matching and Vertex Cover. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, (SPAA), pages 3-12, 2017.
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and Issues in Data Stream Systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 1-16, 2002.
Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O'Callaghan. Maintaining variance and k-medians over data stream windows. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 234-243, 2003.
Rafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. A New Framework for Distributed Submodular Maximization. In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 645-654, 2016.
Jon Louis Bentley and James B Saxe. Decomposable searching problems I. Static-to-dynamic transformation. Journal of Algorithms, 1(4):301-358, 1980.
Vladimir Braverman, Petros Drineas, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Numerical Linear Algebra in the Sliding Window Model. arXiv preprint, 2018. URL: http://arxiv.org/abs/1805.03765.
Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.00889.
Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pages 7:1-7:22, 2018.
Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering on Sliding Windows in Polylogarithmic Space. In 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS, pages 350-364, 2015.
Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering Problems on Sliding Windows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1374-1390, 2016.
Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. Streaming k-means on well-clusterable data. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 26-40. Society for Industrial and Applied Mathematics, 2011.
Vladimir Braverman and Rafail Ostrovsky. Smooth Histograms for Sliding Windows. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 283-293, 2007.
Timothy M. Chan and Bashir S. Sadjad. Geometric Optimization Problems over Sliding Windows. Int. J. Comput. Geometry Appl., 16(2-3):145-158, 2006. A preliminary version appeared in the Proceedings of Algorithms and Computation, 15th International Symposium (ISAAC), 2004.
Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30-39. ACM, 2003.
Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009.
Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst., 10(3):265-294, 2006. A preliminary version appeared in the Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), 2004.
Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 205-214. ACM, 2009.
Edith Cohen and Martin Strauss. Maintaining time-decaying stream aggregates. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 223-233, 2003.
Graham Cormode, Flip Korn, and Srikanta Tirthapura. Time-decaying aggregates in out-of-order streams. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 89-98, 2008.
Graham Cormode, Srikanta Tirthapura, and Bojian Xu. Time-decaying sketches for sensor data aggregation. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 215-224, 2007.
Graham Cormode, Srikanta Tirthapura, and Bojian Xu. Time-Decayed Correlated Aggregates over Data Streams. In Proceedings of the SIAM International Conference on Data Mining, SDM, pages 271-282, 2009.
Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. Sampling Algorithms and Coresets for 𝓁_p Regression. SIAM J. Comput., 38(5):2060-2078, 2009. A preliminary version appeared in the Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA) 2008.
Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM J. Comput., 31(6):1794-1813, 2002. A preliminary version appeared in the Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.
Mayur Datar and S. Muthukrishnan. Estimating Rarity and Similarity over Data Stream Windows. In Algorithms - ESA 2002, 10th Annual European Symposium, Proceedings, pages 323-334, 2002.
Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1117-1126. Society for Industrial and Applied Mathematics, 2006.
Joan Feigenbaum, Sampath Kannan, and Jian Zhang. Computing Diameter in the Streaming and Sliding-Window Models. Algorithmica, 41(1):25-41, 2005.
Dan Feldman, Amos Fiat, and Micha Sharir. Coresets forWeighted Facilities and Their Applications. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 315-324, 2006.
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC, pages 569-578, 2011.
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SoCG), pages 11-18, 2007.
Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 630-649. Society for Industrial and Applied Mathematics, 2010.
Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA), 2013, pages 1434-1453, 2013.
Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762-1792, 2016.
Phillip B. Gibbons and Srikanta Tirthapura. Distributed streams algorithms for sliding windows. In SPAA, pages 63-72, 2002.
Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams. In Foundations of computer science, 2000. proceedings. 41st annual symposium on, pages 359-366. IEEE, 2000.
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 291-300, 2004.
Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080-4088, 2016.
Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 100-108, 2014.
Tsvi Kopelowitz and Ely Porat. Improved Algorithms for Polynomial-Time Decay and Time-Decay with Additive Error. In Theoretical Computer Science, 9th Italian Conference, ICTCS Proceedings, pages 309-322, 2005.
Harry Lang. Online Facility Location on Semi-Random Streams. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.09384.
Lap-Kei Lee and H. F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 290-297, 2006.
Adam Meyerson. Online facility location. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 426-431. IEEE, 2001.
Vahab S. Mirrokni and Morteza Zadimoghaddam. Randomized Composable Core-sets for Distributed Submodular Maximization. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), pages 153-162, 2015.
S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 143-152. IEEE, 2006.
Shang-Yu Su, Pei-Chieh Yuan, and Yun-Nung Chen. How time matters: Learning time-decay attention for contextual spoken language understanding in dialogues. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2133-2142, 2018.
Yan Zheng and Jeff M Phillips. Coresets for Kernel Regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645-654. ACM, 2017.

Improved Algorithms for Time Decay Streams

Authors Vladimir Braverman, Harry Lang, Enayat Ullah, Samson Zhou

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message