Are You My Neighbor? Bringing Order to Neighbor Computing Problems.
David C. Anastasiu, Huzefa Rangwala, and Andrea Tagarelli
Date & Time: Sunday, Aug 4, 2019, 8:00 am - 12:00 pm
Location: Summit 6 - Ground Level, Egan
[Slides] [Introduction] [Audience] [Outline] [Tutors] [Citation] [Bibliography] [Pictures]
Slides
- Nearest Neighbor Problems and Data Types
- Neighbors in Genomics, Proteomics, and Bioinformatics
- Advances in Hashing-Based and Other Approximate Search Methods
- Neighbors in Advertising and Recommender Systems
- Advances in Filtering-Based Search Techniques
- Neighbors in Learning and Mining Problems in Graph Data
Introduction
Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, advertising, community detection, ligand-based virtual screening, etc. As data are being easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons.
In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms as well as the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of "neighbor computing" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.
^topAudience
The tutorial will cover technical material related to efficient methods for identifying nearest neighbors in diverse types of data and key applications for these types of methods. The presentation is meant for an audience with intermediate expertise in data mining having some familiarity with data processing techniques, supervised or unsupervised learning, recommender systems, or web and network systems. Formal definitions of the nearest neighbor search and related variants, along with key applications utilizing these methods will be presented in the tutorial.
^topOutline
The tutorial will cover the following topics:
- Nearest Neighbor Problems and Data Types
- Dense, sparse, and asymmetric data
- Bounded nearest neighbor search
- Nearest neighbor graph construction
- Classical approaches and limitations
- Neighbors in Genomics, Proteomics, and Bioinformatics
- Mass spectrometry search
- Microbiome analysis
- Advances in Hashing-Based and Other Approximate Search Methods
- Locality sensitive hashing variants
- Permutation and graph-based search
- Maximum inner product search
- Neighbors in Advertising and Recommender Systems
- Collaborative filtering at scale
- Learning models based on the neighborhood structure
- Advances in Filtering-Based Search Techniques
- Massive search space pruning by partial indexing
- Effective proximity bounds and when they are most useful
- Neighbors in Learning and Mining Problems in Graph Data
- Neighborhood as cluster in a complex network system
- Neighborhood as influence trigger set
Tutors
David C. Anastasiu is an Assistant Professor in the Department of Computer Engineering at San José State University. His research interests fall broadly at the intersection of machine learning, data mining, computational genomics, and high performance computing. Much of his work has been focused on scalable and efficient methods for analyzing sparse data. He has developed serial and parallel methods for identifying near neighbors, methods for searching related biochemical compounds, methods for characterizing how user behavior changes over time, and methods for personalized and collaborative presentation of Web search results. As a result of his algorithmic work in the area of Data Science, Prof. Anastasiu was awarded the competitive Next Generation Data Scientist (NGDS) Award at the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA'2016). His work has been published in many top-tier conferences and journals, and he serves on the program committees of the most prominent IEEE and ACM data science-related conferences. His research is funded by NSF, Intel Labs, Flex, Infoblox, and NVIDIA Corporation. The tutorial will present material from a combination of his own research as it relates to efficient nearest neighbor search methods and applications in traffic analytics and computational genomics.
Huzefa Rangwala is a Professor at the Department of Computer Science and Engineering, George Mason University. He received his Ph.D. in Computer Science from the University of Minnesota in the year 2008. His research interests include machine learning, learning analytics, bioinformatics and high performance computing. He is the recipient of the NSF Early Faculty Career Award in 2013, the 2014 GMU Teaching Excellence Award, the 2014 Mason Emerging Researcher Creator and Scholar Award, the 2013 Volgenau Outstanding Teaching Faculty Award, 2012 Computer Science Department Outstanding Teaching Faculty Award and 2011 Computer Science Department Outstanding Junior Researcher Award. His research is funded by NSF, NIH, NRL, DARPA, USDA and NVIDIA Corporation. The tutorial will present material from a combination of his own research as it relates to multi-instance learning and multi-task learning. He has presented well attended tutorials at SIAM SDM 2016, 2017 and ACM KDD 2017.
Andrea Tagarelli is an associate professor of computer engineering at the University of Calabria, Italy. He obtained his Ph.D. in computer and systems engineering in 2006. His research interests include topics in data/text mining, machine learning, web and network science, information retrieval. He was program co-chair for the 2018 IEEE/ACM ASONAM conference, and co-organizer of workshops and a mini-symposium on clustering and other data-mining topics in premier conferences in the field (ECIR-16, ACM SIGKDD-13, SIAM DM-14, PAKDD-12, ECML-PKDD-11). He also presented well-attended tutorials on user behavior analysis and mining problems in social networks at WIMS-17, ACM UMAP-15, IEEE/ACM ASONAM-15. He is action editor for the Computational Intelligence Journal and associate editor for the Social Network Analysis and Mining Journal.
Citation
Please use the citation below to refer to our work. ^topBibliography
Deep Learning for Matching in Search and Recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18), pages 1365-1368, ACM, 2018. @inproceedings{xu-xu2018, author = {Jun Xu and Xiangnan He and Hang Li}, title = {Deep Learning for Matching in Search and Recommendation}, booktitle = {The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval}, series = {SIGIR '18}, year = {2018}, pages = {1365-1368}, publisher = {ACM}, address = {New York, NY, USA}, location = {Ann Arbor, MI, USA}, } |
|
Big Data and Recommender Systems. Novática: Journal of the Spanish Computer Scientist Association, 2016. @article{anastasiu-anastasiu2016c, author = {David C. Anastasiu and Evangelia Christakopoulou and Shaden Smith and Mohit Sharma and George Karypis}, title = {Big Data and Recommender Systems}, journal = {Novática: Journal of the Spanish Computer Scientist Association}, number = {240}, month = {October}, year = {2016}, } |
|
A user similarity-based Top-N recommendation approach for mobile in-application advertising. Expert Systems with Applications, 111:51-60, 2018. Big Data Analytics for Business Intelligence @article{hu-hu2018, author = {Jinlong Hu and Junjie Liang and Yuezhen Kuang and Vasant Honavar}, title = {A user similarity-based Top-N recommendation approach for mobile in-application advertising}, journal = {Expert Systems with Applications}, volume = {111}, year = {2018}, issn = {0957-4174}, pages = {51-60}, note = {Big Data Analytics for Business Intelligence}, } |
|
Fast open modification spectral library searching through approximate nearest neighbor indexing. bioRxiv, pages 326173, Cold Spring Harbor Laboratory, 2018. @article{bittremieux-bittremieux2018, author = {Wout Bittremieux and Pieter Meysman and William Stafford Noble and Kris Laukens}, title = {Fast open modification spectral library searching through approximate nearest neighbor indexing}, journal = {bioRxiv}, month = {may}, year = {2018}, pages = {326173}, publisher = {Cold Spring Harbor Laboratory}, } |
|
A Comprehensive Survey of Neighborhood-Based Recommendation Methods. In Recommender Systems Handbook, pages 37-76, Springer US, 2015. @inbook{ning-ning2015survey, author = {Xia Ning and Christian Desrosiers and George Karypis}, title = {A Comprehensive Survey of Neighborhood-Based Recommendation Methods}, booktitle = {Recommender Systems Handbook}, year = {2015}, pages = {37-76}, publisher = {Springer US}, address = {Boston, MA}, } |
|
Streaming Similarity Self-join. Proc. VLDB Endow., 9(10):792-803, VLDB Endowment, 2016. @article{morales-morales2016, author = {Gianmarco De Francisci Morales and Aristides Gionis}, title = {Streaming Similarity Self-join}, journal = {Proc. VLDB Endow.}, issue_date = {June 2016}, volume = {9}, number = {10}, year = {2016}, issn = {2150-8097}, pages = {792-803}, publisher = {VLDB Endowment}, } |
|
EmbedJoin: Efficient Edit Similarity Joins via Embeddings. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), pages 585-594, ACM, 2017. @inproceedings{zhang-zhang2017, author = {Haoyu Zhang and Qin Zhang}, title = {EmbedJoin: Efficient Edit Similarity Joins via Embeddings}, booktitle = {Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, series = {KDD '17}, year = {2017}, pages = {585-594}, publisher = {ACM}, address = {New York, NY, USA}, location = {Halifax, NS, Canada}, } |
|
A Greedy Approach for Budgeted Maximum Inner Product Search. In Advances in Neural Information Processing Systems 30 (NIPS'17), pages 5459-5468, 2017. @inproceedings{yu-yu2017, author = {Hsiang Fu Yu and Cho Jui Hsieh and Qi Lei and Inderjit S. Dhillon}, title = {A Greedy Approach for Budgeted Maximum Inner Product Search}, booktitle = {Advances in Neural Information Processing Systems 30}, series = {NIPS'17}, year = {2017}, pages = {5459-5468}, } |
|
Local community detection in multilayer networks. Data Min. Knowl. Discov., 31(5):1444-1479, 2017. @article{interdonato-interdonato2017, author = {Roberto Interdonato and Andrea Tagarelli and Dino Ienco and Arnaud Sallaberry and Pascal Poncelet}, title = {Local community detection in multilayer networks}, journal = {Data Min. Knowl. Discov.}, volume = {31}, number = {5}, year = {2017}, pages = {1444-1479}, } |
|
Pigeonring: A Principle for Faster Thresholded Similarity Search. Proc. VLDB Endow., 12(1):28-42, VLDB Endowment, 2018. @article{qin-qin2018, author = {Jianbin Qin and Chuan Xiao}, title = {Pigeonring: A Principle for Faster Thresholded Similarity Search}, journal = {Proc. VLDB Endow.}, issue_date = {September 2018}, volume = {12}, number = {1}, year = {2018}, issn = {2150-8097}, pages = {28-42}, publisher = {VLDB Endowment}, } |
|
A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen. BMC Bioinformatics, 19(1):475, 2018. @article{catanese-catanese2018, author = {Helen N. Catanese and Kelly A. Brayton and Assefaw H. Gebremedhin}, title = {A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen}, journal = {BMC Bioinformatics}, volume = {19}, number = {1}, month = {Dec}, year = {2018}, issn = {1471-2105}, pages = {475}, } |
|
AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), pages 455-464, ACM, 2017. @inproceedings{tagami-tagami2017, author = {Yukihiro Tagami}, title = {AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification}, booktitle = {Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, series = {KDD '17}, year = {2017}, pages = {455-464}, publisher = {ACM}, address = {New York, NY, USA}, location = {Halifax, NS, Canada}, } |
|
L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds. In 30th IEEE International Conference on Data Engineering (ICDE '14), 2014. @inproceedings{anastasiu-anastasiu2014, author = {David C. Anastasiu and George Karypis}, title = {L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds}, booktitle = {30th IEEE International Conference on Data Engineering}, series = {ICDE '14}, year = {2014}, location = {Chicago, IL, USA}, } |
|
L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning. In 24th ACM International Conference on Information and Knowledge Management (CIKM '15), 2015. @inproceedings{anastasiu-anastasiu2015, author = {David C. Anastasiu and George Karypis}, title = {L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning}, booktitle = {24th ACM International Conference on Information and Knowledge Management}, series = {CIKM '15}, year = {2015}, location = {Melbourne, Australia}, } |
|
Parallel cosine nearest neighbor graph construction. Journal of Parallel and Distributed Computing, 2017. @article{anastasiu-anastasiu-jpdc2017, author = {David C. Anastasiu and George Karypis}, title = {Parallel cosine nearest neighbor graph construction}, journal = {Journal of Parallel and Distributed Computing}, year = {2017}, issn = {0743-7315}, } |
|
Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient. International Journal of Data Science and Analytics, 4(3):153-172, 2017. @article{anastasiu-anastasiu-jdsa2017, author = {David C. Anastasiu and George Karypis}, title = {Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient}, journal = {International Journal of Data Science and Analytics}, volume = {4}, number = {3}, month = {Nov}, year = {2017}, issn = {2364-4168}, pages = {153-172}, } |
|
Topology-Driven Diversity for Targeted Influence Maximization with Application to User Engagement in Social Networks. IEEE Trans. Knowl. Data Eng., 30(12):2421-2434, 2018. @article{cali\`o-calio2018, author = {Antonio Cali\`o and Roberto Interdonato and Chiara Pulice and Andrea Tagarelli}, title = {Topology-Driven Diversity for Targeted Influence Maximization with Application to User Engagement in Social Networks}, journal = {IEEE Trans. Knowl. Data Eng.}, volume = {30}, number = {12}, year = {2018}, pages = {2421-2434}, } |
|
Learning to Rank Social Bots. In Proceedings of the 29th on Hypertext and Social Media, HT 2018, Baltimore, MD, USA, July 09-12, 2018, pages 183-191, 2018. @inproceedings{perna-perna2018, author = {Diego Perna and Andrea Tagarelli}, title = {Learning to Rank Social Bots}, booktitle = {Proceedings of the 29th on Hypertext and Social Media, HT 2018, Baltimore, MD, USA, July 09-12, 2018}, year = {2018}, pages = {183-191}, } |
|
Consensus Community Detection in Multilayer Networks Using Parameter-Free Graph Pruning. In Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III, pages 193-205, 2018. @inproceedings{mandaglio-mandaglio2018, author = {Domenico Mandaglio and Alessia Amelio and Andrea Tagarelli}, title = {Consensus Community Detection in Multilayer Networks Using Parameter-Free Graph Pruning}, booktitle = {Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III}, year = {2018}, pages = {193-205}, } |
|
Non-metric Similarity Graphs for Maximum Inner Product Search. In Advances in Neural Information Processing Systems 32 (NIPS'18), 2018. @inproceedings{morozov-Morozov2018NonmetricSG, author = {Stanislav Morozov and Artem Babenko}, title = {Non-metric Similarity Graphs for Maximum Inner Product Search}, booktitle = {Advances in Neural Information Processing Systems 32}, series = {NIPS'18}, year = {2018}, } |
|
Ensemble-based community detection in multilayer networks. Data Min. Knowl. Discov., 31(5):1506-1543, 2017. @article{tagarelli-tagarelli2017, author = {Andrea Tagarelli and Alessia Amelio and Francesco Gullo}, title = {Ensemble-based community detection in multilayer networks}, journal = {Data Min. Knowl. Discov.}, volume = {31}, number = {5}, year = {2017}, pages = {1506-1543}, } |
|
Influence Maximization on Social Graphs: A Survey. IEEE Transactions on Knowledge and Data Engineering, 30(10):1852-1872, 2018. @article{li-li2018influence, author = {Y. Li and J. Fan and Y. Wang and K. Tan}, title = {Influence Maximization on Social Graphs: A Survey}, journal = {IEEE Transactions on Knowledge and Data Engineering}, volume = {30}, number = {10}, month = {Oct}, year = {2018}, issn = {1041-4347}, pages = {1852-1872}, } |
|
Binary Hashing for Approximate Nearest Neighbor Search on Big Data: A Survey. IEEE Access, 6:2039-2054, 2018. @article{cao-cao2018, author = {Y. Cao and H. Qi and W. Zhou and J. Kato and K. Li and X. Liu and J. Gui}, title = {Binary Hashing for Approximate Nearest Neighbor Search on Big Data: A Survey}, journal = {IEEE Access}, volume = {6}, year = {2018}, issn = {2169-3536}, pages = {2039-2054}, } |
|
Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees. In 22nd International Conference on Database Theory (ICDT), March 26-29, 2019. @inproceedings{li-Li2019, author = {Yuliang Li and Jianguo Wang and Benjamin Pullman and Nuno Bandeira and Yannis Papakonstantinou}, title = {Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees}, booktitle = {22nd International Conference on Database Theory (ICDT), March 26-29}, year = {2019}, date = {2019-03-13}, } |
|
A map-reduce framework for clustering metagenomes. In 2013 IEEE International Symposium on Parallel \& Distributed Processing, Workshops and Phd Forum, pages 549-558, 2013. @inproceedings{rasheed-rasheed2013map, author = {Zeehasham Rasheed and Huzefa Rangwala}, title = {A map-reduce framework for clustering metagenomes}, booktitle = {2013 IEEE International Symposium on Parallel \& Distributed Processing, Workshops and Phd Forum}, year = {2013}, pages = {549-558}, } |
|
Machine learning approaches for metagenomics. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 512-515, 2014. @inproceedings{rangwala-rangwala2014machine, author = {Huzefa Rangwala and Anveshi Charuvaka and Zeehasham Rasheed}, title = {Machine learning approaches for metagenomics}, booktitle = {Joint European Conference on Machine Learning and Knowledge Discovery in Databases}, year = {2014}, pages = {512-515}, } |
|
Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL). IEEE/ACM transactions on computational biology and bioinformatics, IEEE, 2017. @article{rahman-rahman2017phenotype, author = {Mohammad Arifur Rahman and Nathan LaPierre and Huzefa Rangwala}, title = {Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL)}, journal = {IEEE/ACM transactions on computational biology and bioinformatics}, year = {2017}, publisher = {IEEE}, } |
|
Metagenome sequence clustering with hash-based canopies. Journal of bioinformatics and computational biology, 15(06):1740006, World Scientific Publishing Company, 2017. @article{rahman-rahman2017metagenome, author = {Mohammad Arifur Rahman and Nathan LaPierre and Huzefa Rangwala and Daniel Barbara}, title = {Metagenome sequence clustering with hash-based canopies}, journal = {Journal of bioinformatics and computational biology}, volume = {15}, number = {06}, year = {2017}, pages = {1740006}, publisher = {World Scientific Publishing Company}, } |
|
CAMIL: Clustering and Assembly with Multiple Instance Learning for phenotype prediction. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 33-40, 2016. @inproceedings{lapierre-lapierre2016camil, author = {Nathan LaPierre and Mohammad Arifur Rahman and Huzefa Rangwala}, title = {CAMIL: Clustering and Assembly with Multiple Instance Learning for phenotype prediction}, booktitle = {2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)}, year = {2016}, pages = {33-40}, } |
Pictures
Here are some pictures from the tutorial and KDD'19.