Good Stuff‎ > ‎Data‎ > ‎

Data Science Links

dendogram, reachability graph, OPTICS algorithm


  • The Elements of Statistical Machine Learning link
  • Information Theory, Inference, and Learning Algorithms link
  • Gaussian Processes for Machine Learning link
  • Data-Intensive Text Processing with MapReduce link
  • Mining of Massive Datasets link
  • Reinforcement Learning: An Introduction link
  • Bayesian Reasoning and Machine Learning link
  • Convex Optimization link
  • Introduction to Information Retrieval link
  • Graphical Models, Exponential Families, and Variational Inference link
  • Multiagent Systems link
  • Machine Learning, Neural and Statistical Classification link
  • Deep Learning link


  1. RapidMiner Data Mining Tool link
  2. Weka Data Mining Tool link
  3. KNIME Data Mining Tool link
  4. R Statistical Computing Tool link
  5. Spider Machine Learning Tool in Matlab link
  6. AlphaMiner link
  7. Rattle Data Mining Tool (based on R)  link
  8. Data Analysis Toolbox link
  9. Shogun - Large Scale Machine Learning Toolbox link


C/C++ Source Code

This page contains links to data mining and machine learning algorithms source code in C/C++. 

  • MLC ++ (Collection of Supervised Algorithms) link
  • Neural Optimization Development Engine link
  • Torch (Machine Learning Toolbox) link
  • VFML  (Mining High-speed Data Streams) link
  • Flexible Bayesian Modeling and Markov Chain Sampling link
  • Apriori link
  • Multivariate Polynomial Regression link
  • Decision and Regression Tree Induction link
  • Naive Possibilistic Classifier Induction link
  • Induction of Graphical Network Structures link
  • Multilayer Perceptron link

Java Source Code

This page contains links to data mining and machine learning algorithms source code in Java.

  • MALLET Machine Learning Toolbox link
  • JavaML link
  • MLJ (Java port and extension of MLC++) link

Matlab Source Code 

This page contains links to data mining and machine learning algorithms source code in Matlab. 

  • Netlab Neural Network Software link
  • Bayes Net Toolbox  link
  • Gibbs sampling for hierarchical Bayesian models link
  • MCMC link
  • Flexible Bayesian Modeling and Markov Chain Sampling link

Dataset Categories

annotated clueweb corpus
Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page.

  • General Data Repositories and Benchmarks here
    • UCI Repository link
    • UCI KDD Repository (large datasets) link
    • WIKIPOSIT Dataset List link
    • Public Data Sets on Amazon web Service (large datasets) link
    • Delve Repository (Classification, Regression) link
    • Infochimps Open Catalog link
    • Kevin Chai Dataset Catalog link
    • STATOO Dataset List link1 link2
    • Digging Into Data Repository link
    • Clustering Datasets by Koln University  link
    • Clustering Datasets link
    • Frequent Itemset Mining link


  1. Gunnar Raetsch's Benchmark Datasets (Classification) link
  2. Meyer Benchmark Datasets (Classification and Regression) link
  3. Fundamental Clustering Problem Suite (Clustering) link


  1. Time Series Repository by Eamonn Keogh link
  2. Economic Time Series link
  3. Time Series Data Library by Rob Hyndman link
  • Web and Social Media here


  1. CMU WWW Knowledgebase link
  2. Microsoft Learning to Rank Repository (LETOR) link
  3. Entire WikiPedia link
  4. 44 Million Blog Posts by ICWSM 2009 link
  5. Research data link
  6. Computer Science Department Pages link
  7. Google Flu Trend link
  8. Database of Several Million Human Feelings link
  9. BBC, Digg, MySpace Sentiments link
  10. Linked Data link
  1. WordSimilarity-353 Test Collection link
  2. TREC Repository (Text) link
  3. TechTC Repository (TEXT) link
  4. Information Extraction Repository link
  5. Reuters-21578 (Reuters News Corpus) link
  6. OntoNotes Project (Various Text Corpus) link
  7. PennBioIE (Medical Text Corpus) link
  8. Enron (Email Corpus) link
  9. Andrew McCallum Dataset Collection link
  10. Google Books Ngram Datasets link


  1. PubGene link
  2. GenReg (Gene Regulation Corpus) link
  3. Gene Expression Omnibus (GEO) link
  4. Protein Data Bank (PDB) link
  5. Homo Sapiens Splice Sites Dataset link
  6. Cancer Program Datasets link
  7. Stanford Microarray Dataset link


  1. Berkeley Benchmark datasets of D. melanogaster DNA sequences link
  2. Infobiotics PSP benchmarks repository link

Dataset Categories

  1. Face Recognition link
  2. Yale Face Dataset link
  3. Path Finding in Images (Field Robotics) link
  4. Microsoft Aerial Photographs and Satellite Images link
  5. Photo Tourism link


  1. TRECVID (Video Analysis) link

Dataset Categories

  1. NIST Biometric Datasets (Fingerprints, Faces, etc) link
  • Smart Environment, Ambient Intelligence here


  1. Washington State University Smart Home Repository (CASAS) link
  2. MIT Activity Dataset link
  3. MIT PlaceLab link
  4. University of Tokyo ISIS link
  5. University of Amsterdam CARE link
  6. CMU Multi-Modal Activity Database (CMU-MMAC) link
  7. Edinburgh Pedestrian Database link
  8. University of Florida Gator Tech Smart House link
  9. Intel Berekly Research Lab link


  1. Wearable Action Recognition Database (WARD) link
  • Space, Environment, Music, etc here


  1. Space Science Data link
  2. FreeDB Music Database here


  1. KDD Cup link
  2. $3M Heritage Health Data Analysis Prize link


  1. UK Government Publicly Available Datasets link
  2. San Francisco Publicly Available Datasets link
  3. A collection of US state and Federal datasets link
  4. Comprehensive US Statistics link
  5. National Government Statistics link
  6. US Census link
  7. University of Washington Various Datasets link
  1. Financial Data Collection link
  2. AMEX, NYSE, NASDAQ Stocks (Not Free) link
  3. NASDAQ Data Store link
  • Statistics Community here


  1. Function Approximation Repository link
  2. Statistical Reference Datasets link
  3. StatLib archive link


  1. Wiki API link
  2. Twitter API link
  3. BestBuy API link
  4. Reddit API link
  5. Flicker API link
  6. Netflix API link
  7. Linked Movie Database API link
  8. Thesaurus API link
  9. Yahoo Music API link
  10. Google Trends API link
  11. Reuters Spotlight API link
  12. BackType API (blogs, etc) link
  13. Kiva API (Micro fund Organization) link
  14. BART API link
  15. Trading Solutions API
  16. Campaign Finance API link