Tiny Images Dataset

Tiny Images Dataset

Rob Fergus [1]   Antonio Torralba [2]   William T. Freeman [2]

[1] Dept. of Computer Science, Courant Institute, New York University

[2] Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology



Overview

This page has links for downloading the Tiny Images dataset, which consists of 79,302,017 images, each being a 32x32 color image. This data is stored in the form of large binary files which can be accesed by a Matlab toolbox that we have written. You will need around 400Gb of free disk space to store all the files. In total there are 5 files that need to be downloaded, 3 of which are large binary files consisting of (i) the images themselves; (ii) their associated metadata (filename, search engine used, ranking etc.); (iii) Gist descriptors for each image. The other two files are the Matlab toolbox and index data file that together let you easily load in data from the binaries.

Downloads
Note that these files are very large and will take a considerable time to download. Please ensure you have sufficient disk space before commencing the download.

  1. Image binary (227Gb)   Download

  2. Metadata binary (57Gb)  Download

  3. Gist binary (114Gb)  Download

  4. Index data (7Mb)  Download

  5. Matlab Tiny Images toolbox (150Kb)  Download

Instructions
Overview
--------
The 79 million images are stored in one giant binary file, 227Gb in size. The metadata accompanying each image is also in a single giant file, 57Gb in size. To read images/metadata from these files, we have provided some Matlab wrapper functions.

There are two versions of the functions for reading image data:
(i) loadTinyImages.m - plain Matlab function (no MEX), runs under 32/64bits. Loads images in by image number. Use this by default.
(ii) read_tiny_big_binary.m - Matlab wrapper for 64-bit MEX function. A bit faster and more flexible than (i), but requires a 64-bit machine.

There are two types of annotation data:
(i) Manual annotation data, sorted in annotations.txt, that holds the label of images manually inspected to see if image content agrees with noun used to collect it. Some other information, such as search engine, is also stored. This data is available for only a very small portion of images.
(ii) Automatic annotation data, stored in tiny_metadata.bin, consisting of information relating the gathering of the image, e.g. search engine, which page, url to thumbnail etc. This data is available for all 79 million images.

Requirements
------------
1. Around 300Gb of disk space.

2. If you want to use the MEX versions of the code for reading in the data, you will need a 64-bit machine. But for most purposes, the Matlab implementation (loadTinyImages.m), which can use either 32 or 64bits will work perfectly well. To discover if you have a 32/64bit machine, type 'uname -a' in an xterm (if using linux).

Files
-----

The .tgz file should contain 10 files

1. loadTinyImages.m -- read tiny image data, pure Matlab version.
2. loadGroundTruth.m -- read annotations.txt file holding manual annotations
3. read_tiny_big_binary.m -- read tiny image data, 64-bit Matlab/MEX version
4. read_tiny_big_metadata.m -- read tiny image metadata, 64-bit Matlab/MEX version
5. read_tiny_gist_binary.m -- read tiny Gist, 64-bit Matlab/MEX version
6. read_tiny_binary_big_core.c -- 64-bit MEX source code for image reading
7. read_tiny_metadata_big_core.c -- 64-bit MEX source code for metadata reading
8. read_tiny_binary_gist_core.c -- 64-bit MEX source code for gist reading
9. compute_hash_function.m -- utility function to do fast string searching as used by read_tiny_big_binary.m and read_tiny_big_metadata.m
10. fast_str2num.m -- utility function for -- -- read_tiny_big_metadata.m
11. annotations.txt -- text file holding list of annotated images
12. README.txt -- this file

Separately, you should have downloaded the following files

1. tiny_images.bin - 227Gb file holding 79,302,017 images
2. tiny_metadata.bin - 57Gb file holding metadata for all 79,302,017 images
3. tinygist80million.bin - 114Gb file holding 384-dim Gist descriptors for all 79,302,017 images
4. tiny_index.mat - 7Mb file holding index info, including:
        word - cell array of all 75,846 nouns for which we have images in tiny_images.bin
        num_imgs - vector with #images per noun for all 75,846 nouns

Preliminaries
-------------
Before the functions can be used you must do two things:

1. Set the absolute paths to the binary files in the Matlab functions. There are a total of 7 lines that must be set:

(i) loadTinyImages.m, line 14 -- set path to tiny_images.bin file
(ii) read_tiny_big_binary.m, line 40 -- set path to tiny_images.bin file
(iii) read_tiny_big_binary.m, line 42 -- set path to tiny_index.mat file
(iv) read_tiny_big_metadata.m, line 63 -- set path to tiny_metadata.bin file
(v) read_tiny_big_metadata.m, line 65 -- set path to tiny_index.mat file
(vi) read_tiny_gist_binary.m, line 36 -- set path to tiny_index.mat file
(vii) read_tiny_gist_binary.m, line 38 -- set path to tiny_metadata.bin file

2. If using the MEX versions, they must be compiled with the commands:
(i) mex read_tiny_binary_big_core.c
(ii) mex read_tiny_metadata_big_core.c
(iii) mex read_tiny_binary_gist_core.c

Usage
-----

Here are some examples of the scripts in use. Please look at the comments at the top of each file for more extensive explanations.

loadTinyImages.m
---------------

% load in first 10 images from 79,302,017 images
img = loadTinyImages([1:10]);

% load in 10 images at random q = randperm(79302017);
img = loadTinyImages(q(1:10));
%% N.B. function does NOT sort indices, so sorting beforehand would
%% improve speed.


loadGroundTruth.m
-----------------

% read in contents of annotation.txt file
[imageFileName, keyword, correct, engine, ind_engine, image_ndx]=loadGroundTruth;
%%% the labeling convention in correct is:
% -1 = Incorrect, 0 = Skipped, 1 = Correct
% Note that this different to the 'label' field produced by % read_tiny_big_metadata below (meaning of -1 and 0 are swapped)
% but the annotation.txt file information should be used in preference to
% that from read_tiny_big_metadata.m


64-bit MEX versions:
--------------------

read_tiny_big_metadata.m
----------------------

% load in filenames of first 10 images
data = read_tiny_big_metadata([1:10],{'filename'});

% load in search engine used for
% first 10 images from noun 'aardvark';

data = read_tiny_big_metadata('aardvark',[1:10],{'engine'});

read_tiny_big_binary.m
----------------------

% load in first 10 images from 79,302,017 images
img = read_tiny_big_binary([1:10]);
% note output dimension is 3072x10, rather than 32x32x3x10 % as for loadTinyImages.m

% load in first 10 images from noun 'dog';
q = randperm(79302017);
img = read_tiny_big_binary('dog',q(1:10));
% function sorts indices internally for speed

% load in images for different nouns
img = read_tiny_big_binary({'dog','cat','mouse','pig'},{[1:5],[1:2:10],[8 13],[4:-1:1]});