Are You My Neighbor? Bringing Order to Neighbor Computing Problems.

David C. Anastasiu, Huzefa Rangwala, and Andrea Tagarelli

Date & Time: Sunday, Aug 4, 2019, 8:00 am - 12:00 pm

Location: Summit 6 - Ground Level, Egan

[Slides] [Introduction] [Audience] [Outline] [Tutors] [Citation] [Bibliography] [Pictures]

Slides

Introduction

Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, advertising, community detection, ligand-based virtual screening, etc. As data are being easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons.

In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms as well as the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of "neighbor computing" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.

^top

Audience

The tutorial will cover technical material related to efficient methods for identifying nearest neighbors in diverse types of data and key applications for these types of methods. The presentation is meant for an audience with intermediate expertise in data mining having some familiarity with data processing techniques, supervised or unsupervised learning, recommender systems, or web and network systems. Formal definitions of the nearest neighbor search and related variants, along with key applications utilizing these methods will be presented in the tutorial.

^top

Outline

The tutorial will cover the following topics:

Nearest Neighbor Problems and Data Types

Dense, sparse, and asymmetric data
Bounded nearest neighbor search
Nearest neighbor graph construction
Classical approaches and limitations

Neighbors in Genomics, Proteomics, and Bioinformatics

Mass spectrometry search
Microbiome analysis

Advances in Hashing-Based and Other Approximate Search Methods

Locality sensitive hashing variants
Permutation and graph-based search
Maximum inner product search

Neighbors in Advertising and Recommender Systems

Collaborative filtering at scale
Learning models based on the neighborhood structure

Advances in Filtering-Based Search Techniques

Massive search space pruning by partial indexing
Effective proximity bounds and when they are most useful

Neighbors in Learning and Mining Problems in Graph Data

Neighborhood as cluster in a complex network system
Neighborhood as influence trigger set

^top

Tutors

David C. Anastasiu is an Assistant Professor in the Department of Computer Engineering at San José State University. His research interests fall broadly at the intersection of machine learning, data mining, computational genomics, and high performance computing. Much of his work has been focused on scalable and efficient methods for analyzing sparse data. He has developed serial and parallel methods for identifying near neighbors, methods for searching related biochemical compounds, methods for characterizing how user behavior changes over time, and methods for personalized and collaborative presentation of Web search results. As a result of his algorithmic work in the area of Data Science, Prof. Anastasiu was awarded the competitive Next Generation Data Scientist (NGDS) Award at the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA'2016). His work has been published in many top-tier conferences and journals, and he serves on the program committees of the most prominent IEEE and ACM data science-related conferences. His research is funded by NSF, Intel Labs, Flex, Infoblox, and NVIDIA Corporation. The tutorial will present material from a combination of his own research as it relates to efficient nearest neighbor search methods and applications in traffic analytics and computational genomics.

Huzefa Rangwala is a Professor at the Department of Computer Science and Engineering, George Mason University. He received his Ph.D. in Computer Science from the University of Minnesota in the year 2008. His research interests include machine learning, learning analytics, bioinformatics and high performance computing. He is the recipient of the NSF Early Faculty Career Award in 2013, the 2014 GMU Teaching Excellence Award, the 2014 Mason Emerging Researcher Creator and Scholar Award, the 2013 Volgenau Outstanding Teaching Faculty Award, 2012 Computer Science Department Outstanding Teaching Faculty Award and 2011 Computer Science Department Outstanding Junior Researcher Award. His research is funded by NSF, NIH, NRL, DARPA, USDA and NVIDIA Corporation. The tutorial will present material from a combination of his own research as it relates to multi-instance learning and multi-task learning. He has presented well attended tutorials at SIAM SDM 2016, 2017 and ACM KDD 2017.

Andrea Tagarelli is an associate professor of computer engineering at the University of Calabria, Italy. He obtained his Ph.D. in computer and systems engineering in 2006. His research interests include topics in data/text mining, machine learning, web and network science, information retrieval. He was program co-chair for the 2018 IEEE/ACM ASONAM conference, and co-organizer of workshops and a mini-symposium on clustering and other data-mining topics in premier conferences in the field (ECIR-16, ACM SIGKDD-13, SIAM DM-14, PAKDD-12, ECML-PKDD-11). He also presented well-attended tutorials on user behavior analysis and mining problems in social networks at WIMS-17, ACM UMAP-15, IEEE/ACM ASONAM-15. He is action editor for the Computational Intelligence Journal and associate editor for the Social Network Analysis and Mining Journal.

^top

Citation

Please use the citation below to refer to our work.

David C. Anastasiu, Huzefa Rangwala & Andrea Tagarelli. Are You My Neighbor? Bringing Order to Neighbor Computing Problems.In The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'19), ACM. Anchorage, AK, USA, 2019-08-04.

@inproceedings{anastasiu-kdd2019,
       author    = {David C. Anastasiu and Huzefa Rangwala and Andrea Tagarelli},
       title     = {Are You My Neighbor? Bringing Order to Neighbor Computing Problems.},
       booktitle = {The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
       series    = {KDD'19},
       year      = {2019},
       date      = {2019-08-04},
       publisher = {ACM},
       address   = {New York, NY, USA},
       location  = {Anchorage, AK, USA},
       note      = {Tutorial presented at KDD'19, Aug 04, 2019.},
    }

^top

Bibliography

	Jun Xu, Xiangnan He, Hang Li. Deep Learning for Matching in Search and Recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18), pages 1365-1368, ACM, 2018. @inproceedings{xu-xu2018, author = {Jun Xu and Xiangnan He and Hang Li}, title = {Deep Learning for Matching in Search and Recommendation}, booktitle = {The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval}, series = {SIGIR '18}, year = {2018}, pages = {1365-1368}, publisher = {ACM}, address = {New York, NY, USA}, location = {Ann Arbor, MI, USA}, }
	David C. Anastasiu, Evangelia Christakopoulou, Shaden Smith, Mohit Sharma, George Karypis. Big Data and Recommender Systems. Novática: Journal of the Spanish Computer Scientist Association, 2016. @article{anastasiu-anastasiu2016c, author = {David C. Anastasiu and Evangelia Christakopoulou and Shaden Smith and Mohit Sharma and George Karypis}, title = {Big Data and Recommender Systems}, journal = {Novática: Journal of the Spanish Computer Scientist Association}, number = {240}, month = {October}, year = {2016}, }
	Jinlong Hu, Junjie Liang, Yuezhen Kuang, Vasant Honavar. A user similarity-based Top-N recommendation approach for mobile in-application advertising. Expert Systems with Applications, 111:51-60, 2018. Big Data Analytics for Business Intelligence @article{hu-hu2018, author = {Jinlong Hu and Junjie Liang and Yuezhen Kuang and Vasant Honavar}, title = {A user similarity-based Top-N recommendation approach for mobile in-application advertising}, journal = {Expert Systems with Applications}, volume = {111}, year = {2018}, issn = {0957-4174}, pages = {51-60}, note = {Big Data Analytics for Business Intelligence}, }
	Wout Bittremieux, Pieter Meysman, William Stafford Noble, Kris Laukens. Fast open modification spectral library searching through approximate nearest neighbor indexing. bioRxiv, pages 326173, Cold Spring Harbor Laboratory, 2018. @article{bittremieux-bittremieux2018, author = {Wout Bittremieux and Pieter Meysman and William Stafford Noble and Kris Laukens}, title = {Fast open modification spectral library searching through approximate nearest neighbor indexing}, journal = {bioRxiv}, month = {may}, year = {2018}, pages = {326173}, publisher = {Cold Spring Harbor Laboratory}, }
	Xia Ning, Christian Desrosiers, George Karypis. A Comprehensive Survey of Neighborhood-Based Recommendation Methods. In Recommender Systems Handbook, pages 37-76, Springer US, 2015. @inbook{ning-ning2015survey, author = {Xia Ning and Christian Desrosiers and George Karypis}, title = {A Comprehensive Survey of Neighborhood-Based Recommendation Methods}, booktitle = {Recommender Systems Handbook}, year = {2015}, pages = {37-76}, publisher = {Springer US}, address = {Boston, MA}, }
	Gianmarco De Francisci Morales, Aristides Gionis. Streaming Similarity Self-join. Proc. VLDB Endow., 9(10):792-803, VLDB Endowment, 2016. @article{morales-morales2016, author = {Gianmarco De Francisci Morales and Aristides Gionis}, title = {Streaming Similarity Self-join}, journal = {Proc. VLDB Endow.}, issue_date = {June 2016}, volume = {9}, number = {10}, year = {2016}, issn = {2150-8097}, pages = {792-803}, publisher = {VLDB Endowment}, }
	Haoyu Zhang, Qin Zhang. EmbedJoin: Efficient Edit Similarity Joins via Embeddings. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), pages 585-594, ACM, 2017. @inproceedings{zhang-zhang2017, author = {Haoyu Zhang and Qin Zhang}, title = {EmbedJoin: Efficient Edit Similarity Joins via Embeddings}, booktitle = {Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, series = {KDD '17}, year = {2017}, pages = {585-594}, publisher = {ACM}, address = {New York, NY, USA}, location = {Halifax, NS, Canada}, }
	Hsiang Fu Yu, Cho Jui Hsieh, Qi Lei, Inderjit S. Dhillon. A Greedy Approach for Budgeted Maximum Inner Product Search. In Advances in Neural Information Processing Systems 30 (NIPS'17), pages 5459-5468, 2017. @inproceedings{yu-yu2017, author = {Hsiang Fu Yu and Cho Jui Hsieh and Qi Lei and Inderjit S. Dhillon}, title = {A Greedy Approach for Budgeted Maximum Inner Product Search}, booktitle = {Advances in Neural Information Processing Systems 30}, series = {NIPS'17}, year = {2017}, pages = {5459-5468}, }
	Roberto Interdonato, Andrea Tagarelli, Dino Ienco, Arnaud Sallaberry, Pascal Poncelet. Local community detection in multilayer networks. Data Min. Knowl. Discov., 31(5):1444-1479, 2017. @article{interdonato-interdonato2017, author = {Roberto Interdonato and Andrea Tagarelli and Dino Ienco and Arnaud Sallaberry and Pascal Poncelet}, title = {Local community detection in multilayer networks}, journal = {Data Min. Knowl. Discov.}, volume = {31}, number = {5}, year = {2017}, pages = {1444-1479}, }
	Jianbin Qin, Chuan Xiao. Pigeonring: A Principle for Faster Thresholded Similarity Search. Proc. VLDB Endow., 12(1):28-42, VLDB Endowment, 2018. @article{qin-qin2018, author = {Jianbin Qin and Chuan Xiao}, title = {Pigeonring: A Principle for Faster Thresholded Similarity Search}, journal = {Proc. VLDB Endow.}, issue_date = {September 2018}, volume = {12}, number = {1}, year = {2018}, issn = {2150-8097}, pages = {28-42}, publisher = {VLDB Endowment}, }
	Helen N. Catanese, Kelly A. Brayton, Assefaw H. Gebremedhin. A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen. BMC Bioinformatics, 19(1):475, 2018. @article{catanese-catanese2018, author = {Helen N. Catanese and Kelly A. Brayton and Assefaw H. Gebremedhin}, title = {A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen}, journal = {BMC Bioinformatics}, volume = {19}, number = {1}, month = {Dec}, year = {2018}, issn = {1471-2105}, pages = {475}, }
	Yukihiro Tagami. AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), pages 455-464, ACM, 2017. @inproceedings{tagami-tagami2017, author = {Yukihiro Tagami}, title = {AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification}, booktitle = {Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, series = {KDD '17}, year = {2017}, pages = {455-464}, publisher = {ACM}, address = {New York, NY, USA}, location = {Halifax, NS, Canada}, }
	David C. Anastasiu, George Karypis. L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds. In 30th IEEE International Conference on Data Engineering (ICDE '14), 2014. @inproceedings{anastasiu-anastasiu2014, author = {David C. Anastasiu and George Karypis}, title = {L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds}, booktitle = {30th IEEE International Conference on Data Engineering}, series = {ICDE '14}, year = {2014}, location = {Chicago, IL, USA}, }
	David C. Anastasiu, George Karypis. L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning. In 24th ACM International Conference on Information and Knowledge Management (CIKM '15), 2015. @inproceedings{anastasiu-anastasiu2015, author = {David C. Anastasiu and George Karypis}, title = {L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning}, booktitle = {24th ACM International Conference on Information and Knowledge Management}, series = {CIKM '15}, year = {2015}, location = {Melbourne, Australia}, }
	David C. Anastasiu, George Karypis. Parallel cosine nearest neighbor graph construction. Journal of Parallel and Distributed Computing, 2017. @article{anastasiu-anastasiu-jpdc2017, author = {David C. Anastasiu and George Karypis}, title = {Parallel cosine nearest neighbor graph construction}, journal = {Journal of Parallel and Distributed Computing}, year = {2017}, issn = {0743-7315}, }
	David C. Anastasiu, George Karypis. Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient. International Journal of Data Science and Analytics, 4(3):153-172, 2017. @article{anastasiu-anastasiu-jdsa2017, author = {David C. Anastasiu and George Karypis}, title = {Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient}, journal = {International Journal of Data Science and Analytics}, volume = {4}, number = {3}, month = {Nov}, year = {2017}, issn = {2364-4168}, pages = {153-172}, }
	Antonio Cali\`o, Roberto Interdonato, Chiara Pulice, Andrea Tagarelli. Topology-Driven Diversity for Targeted Influence Maximization with Application to User Engagement in Social Networks. IEEE Trans. Knowl. Data Eng., 30(12):2421-2434, 2018. @article{cali\`o-calio2018, author = {Antonio Cali\`o and Roberto Interdonato and Chiara Pulice and Andrea Tagarelli}, title = {Topology-Driven Diversity for Targeted Influence Maximization with Application to User Engagement in Social Networks}, journal = {IEEE Trans. Knowl. Data Eng.}, volume = {30}, number = {12}, year = {2018}, pages = {2421-2434}, }
	Diego Perna, Andrea Tagarelli. Learning to Rank Social Bots. In Proceedings of the 29th on Hypertext and Social Media, HT 2018, Baltimore, MD, USA, July 09-12, 2018, pages 183-191, 2018. @inproceedings{perna-perna2018, author = {Diego Perna and Andrea Tagarelli}, title = {Learning to Rank Social Bots}, booktitle = {Proceedings of the 29th on Hypertext and Social Media, HT 2018, Baltimore, MD, USA, July 09-12, 2018}, year = {2018}, pages = {183-191}, }
	Domenico Mandaglio, Alessia Amelio, Andrea Tagarelli. Consensus Community Detection in Multilayer Networks Using Parameter-Free Graph Pruning. In Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III, pages 193-205, 2018. @inproceedings{mandaglio-mandaglio2018, author = {Domenico Mandaglio and Alessia Amelio and Andrea Tagarelli}, title = {Consensus Community Detection in Multilayer Networks Using Parameter-Free Graph Pruning}, booktitle = {Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III}, year = {2018}, pages = {193-205}, }
	Stanislav Morozov, Artem Babenko. Non-metric Similarity Graphs for Maximum Inner Product Search. In Advances in Neural Information Processing Systems 32 (NIPS'18), 2018. @inproceedings{morozov-Morozov2018NonmetricSG, author = {Stanislav Morozov and Artem Babenko}, title = {Non-metric Similarity Graphs for Maximum Inner Product Search}, booktitle = {Advances in Neural Information Processing Systems 32}, series = {NIPS'18}, year = {2018}, }
	Andrea Tagarelli, Alessia Amelio, Francesco Gullo. Ensemble-based community detection in multilayer networks. Data Min. Knowl. Discov., 31(5):1506-1543, 2017. @article{tagarelli-tagarelli2017, author = {Andrea Tagarelli and Alessia Amelio and Francesco Gullo}, title = {Ensemble-based community detection in multilayer networks}, journal = {Data Min. Knowl. Discov.}, volume = {31}, number = {5}, year = {2017}, pages = {1506-1543}, }
	Y. Li, J. Fan, Y. Wang, K. Tan. Influence Maximization on Social Graphs: A Survey. IEEE Transactions on Knowledge and Data Engineering, 30(10):1852-1872, 2018. @article{li-li2018influence, author = {Y. Li and J. Fan and Y. Wang and K. Tan}, title = {Influence Maximization on Social Graphs: A Survey}, journal = {IEEE Transactions on Knowledge and Data Engineering}, volume = {30}, number = {10}, month = {Oct}, year = {2018}, issn = {1041-4347}, pages = {1852-1872}, }
	Y. Cao, H. Qi, W. Zhou, J. Kato, K. Li, X. Liu, J. Gui. Binary Hashing for Approximate Nearest Neighbor Search on Big Data: A Survey. IEEE Access, 6:2039-2054, 2018. @article{cao-cao2018, author = {Y. Cao and H. Qi and W. Zhou and J. Kato and K. Li and X. Liu and J. Gui}, title = {Binary Hashing for Approximate Nearest Neighbor Search on Big Data: A Survey}, journal = {IEEE Access}, volume = {6}, year = {2018}, issn = {2169-3536}, pages = {2039-2054}, }
	Yuliang Li, Jianguo Wang, Benjamin Pullman, Nuno Bandeira, Yannis Papakonstantinou. Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees. In 22nd International Conference on Database Theory (ICDT), March 26-29, 2019. @inproceedings{li-Li2019, author = {Yuliang Li and Jianguo Wang and Benjamin Pullman and Nuno Bandeira and Yannis Papakonstantinou}, title = {Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees}, booktitle = {22nd International Conference on Database Theory (ICDT), March 26-29}, year = {2019}, date = {2019-03-13}, }
	Zeehasham Rasheed, Huzefa Rangwala. A map-reduce framework for clustering metagenomes. In 2013 IEEE International Symposium on Parallel \& Distributed Processing, Workshops and Phd Forum, pages 549-558, 2013. @inproceedings{rasheed-rasheed2013map, author = {Zeehasham Rasheed and Huzefa Rangwala}, title = {A map-reduce framework for clustering metagenomes}, booktitle = {2013 IEEE International Symposium on Parallel \& Distributed Processing, Workshops and Phd Forum}, year = {2013}, pages = {549-558}, }
	Huzefa Rangwala, Anveshi Charuvaka, Zeehasham Rasheed. Machine learning approaches for metagenomics. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 512-515, 2014. @inproceedings{rangwala-rangwala2014machine, author = {Huzefa Rangwala and Anveshi Charuvaka and Zeehasham Rasheed}, title = {Machine learning approaches for metagenomics}, booktitle = {Joint European Conference on Machine Learning and Knowledge Discovery in Databases}, year = {2014}, pages = {512-515}, }
	Mohammad Arifur Rahman, Nathan LaPierre, Huzefa Rangwala. Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL). IEEE/ACM transactions on computational biology and bioinformatics, IEEE, 2017. @article{rahman-rahman2017phenotype, author = {Mohammad Arifur Rahman and Nathan LaPierre and Huzefa Rangwala}, title = {Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL)}, journal = {IEEE/ACM transactions on computational biology and bioinformatics}, year = {2017}, publisher = {IEEE}, }
	Mohammad Arifur Rahman, Nathan LaPierre, Huzefa Rangwala, Daniel Barbara. Metagenome sequence clustering with hash-based canopies. Journal of bioinformatics and computational biology, 15(06):1740006, World Scientific Publishing Company, 2017. @article{rahman-rahman2017metagenome, author = {Mohammad Arifur Rahman and Nathan LaPierre and Huzefa Rangwala and Daniel Barbara}, title = {Metagenome sequence clustering with hash-based canopies}, journal = {Journal of bioinformatics and computational biology}, volume = {15}, number = {06}, year = {2017}, pages = {1740006}, publisher = {World Scientific Publishing Company}, }
	Nathan LaPierre, Mohammad Arifur Rahman, Huzefa Rangwala. CAMIL: Clustering and Assembly with Multiple Instance Learning for phenotype prediction. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 33-40, 2016. @inproceedings{lapierre-lapierre2016camil, author = {Nathan LaPierre and Mohammad Arifur Rahman and Huzefa Rangwala}, title = {CAMIL: Clustering and Assembly with Multiple Instance Learning for phenotype prediction}, booktitle = {2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)}, year = {2016}, pages = {33-40}, }

^top

Pictures

Here are some pictures from the tutorial and KDD'19.