Dataset in Our Publications

  1. Xiaohuan Cao, Yuyan Zheng, Chuan Shi, Jingzhi Li, Bin Wu. Meta-path-based link prediction in schema-rich heterogeneous information network. International Journal of Data Science and Analytics. 2017 [pdf]
  2. Dataset: Yago [link]

  3. Jing Zheng, Jian Liu, Chuan Shi, Fuzhen Zhuang, Jingzhi Li, Bin Wu. Recommendation in heterogeneous information network via dual similarity regularization. International Journal of Data Science and Analytics. 2017 [pdf]
  4. Dataset: Douban movie, Douban book, Yelp [download]

    Douban book: The data set in book domain comprises 792,062 ratings (scales 1–5) by 13,024 users on 22,347 books.

  5. Jian Liu, Chuan Shi, Binbin Hu, Shenghua Liu, Philip S. Yu. Personalized Ranking Recommendation via Integrating Multiple Feedbacks. PAKDD 2017. [pdf]
  6. Dataset: Douban book, Dianping [download]

    Douban book: The Douban Book dataset contains 190,590 ratings (1-5 scales) involving 12,850 users and 22,040 books.

    Dianping: The Dianping dataset contains 188,813 ratings (1-5 scales) involving 10,549 users and 17,707 restaurants.

  7. Yuyan Zheng, Chuan Shi, Xiaohuan Cao, Xiaoli Li, Bin Wu. Entity Set Expansion with Meta Path in Knowledge Graph. PAKDD 2017. [pdf]
  8. Dataset: Yago [link]

    Yago: Yago is a huge semantic knowledge graph derived from Wikipedia, WordNet and GeoNames. Currently, it has knowledge about more than 10 million entities and contains more than 120 million facts. We adopt "yagoFacts", "yagoSimpleTypes" and "yagoTaxonomy" parts of this dataset to conduct experiments, which contain 35 relationships, more than 1.3 million entities of 3,455 instance classes.

  9. Chuan Shi, Yitong Li, Philip S. Yu, Bin Wu. Constrained-Meta-Path based Ranking in Heterogeneous Information Network. Knowledge and Information System, 49(2), 719-747, 2016. [pdf]
  10. Dataset: DBLP, ACM, IMDB [link]

  11. Chuan Shi, Jian Liu, Fuzheng Zhuang, Philip S. Yu, Bin Wu. Integrating Heterogeneous Information via Flexible Regularization Framework for Recommendation. Knowledge and Information System, 2016. [pdf]
  12. Dataset: Douban movie, MovieLens, Yelp challenge, Douban book

  13. Yitong Li, Chuan Shi, Huidong Zhao, Fuzhen Zhuang, and Bin Wu. Aspect Mining with Rating Bias. ECML2016. [pdf]
  14. Dataset: TripAdvisor, Dianping [download]

    TripAdvisor: The TripAdvisor dataset contains 162,595 ratings on 79013 users on 5530 hotels.

    Dianping: The Dianping dataset contains 216,291 ratings by 14,022 users on 1097 restaurants.

  15. Xiaohuan Cao, Yuyan Zheng, Chuan Shi, Bin Wu. Link Prediction in Schema-Rich Heterogeneous Information Network. PAKDD2016. [pdf]
  16. Dataset: Yago [link]

    Yago: Yalgo is a large-scale Knowledge Graph, which derived from Wikipedia, WordNet and GeoNames. The dataset includes more than ten million entities and 120 million facts made from these entities. We only adopt “COREFact" of this dataset, which contains 4,484,914 facts, 35 relationships and 1,369,931 entities of 3,455 types.

  17. Jing Zheng, Jian Liu, Chuan Shi, Fuzheng Zhuang, Bin Wu. Dual Similarity Regularization for Recommendation. PAKDD2016. [pdf]
  18. Dataset: Douban movie, Yelp [download]

    Douban movie: Douban is a well known social media network in China. The dataset includes 3,022 users and 6,971 movies with 195,493 ratings ranging from 1 to 5.

    Yelp: Yelp is a famous user review website in America. The dataset includes 14,085 users and 14,037 movies with 194,255 ratings ranging from 1 to 5.

  19. Chuan Shi, Zhiqiang Zhang, Ping Luo, Philip S. Yu, Yading Yue, Bin Wu. Semantic Path based Personalized Recommendation on Weighted Heterogeneous Information Networks. CIKM2015: 453-462. [pdf] [code]
  20. Dataset: Douban movie, Yelp [download]

    Douban movie: Douban dataset includes 13,367 users and 12,677 movies with 1068278 movie ratings ranging from 1 to 5.

    Yelp: Yelp dataset contains user ratings on local business and attribute information of users and businesses. The dataset includes 16239 users and 14284 local businesses with 198397 ratings from 1 to 5.

  21. Chuan Shi, Xiangnan Kong, Yue Huang, Philip S. Yu, Bin Wu. HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks. IEEE Transactions on Knowledge and Data Engineering, 26(10): 2479-2492, 2014. [pdf]  [code]
  22. Dataset: ACM, DBLP [download]

    ACM: The data set has 12K papers, 17K authors, and 1.8K author affiliations.

    DBLP: The dataset contains 14K papers, 20 conferences, 14K authors and 8.9K terms, with a total number of 17K links. In the data set, 4,057 authors, all 20 conferences and 100 papers are labeled with one of the four research areas.

Dataset from the Internet

  1. ASU Social Computing Data Repository
  2. Stanford Large Network Dataset Collection
  3. AMiner Dataset
  4. Text REtrieval Conference (TREC) Data
  5. CiteSeerX Data
  6. Jazz musicians network dataset
  7. Tianchi Data Lab
  8. Database and Information System (DBIS) Data: The DBIS dataset was constructed and used by Sun et al. It covers 464 venues, their top-5000 authors, and corresponding 72,902 publications. Citation: Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. In VLDB'11. 992–1003.
  9. Zhihu Dateset: Scrapy the Zhihu content and user social network information. Now it contains 314400 questions and 261376 users.
  10. Meetup Dataset: Each file contains information about participation patterns of Meetup users in group events for particular Meetup group. Data are fully anonymized.
  11. Netflix Prize Dataset: It is the official data set used in the Netflix Prize competition. The data consists of about 100 million movie ratings, and the goal is to predict missing entries in the movie-user rating matrix.
  12. Movielens: GroupLens Research has collected and made available rating data sets from the MovieLens.
  13. Citation Network Dataset: The dataset is designed for research purpose only. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.
  14. IMDB: This is a link dataset built with permission from the Internet Movie Data (IMDB).
  15. Yelp: The Yelp dataset is a subset of businesses, reviews, and user data for use in personal, educational, and academic purposes.
  16. 20 Newsgroups: The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
  17. YAGO: YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.