We have collected a dataset of 116 point-clouds of objects with 249 object parts (examples shown below). There are also a total of 250 natural language instructions (in 155 manuals). Using the crowdsourcing platform Robobarista, we collected 1225 trajectories for these objects from 71 non-expert users on the Amazon Mechanical Turk. After a user is shown a 20-second instructional video, the user first completes a 2-minute tutorial task. At each session, the user was asked to complete 10 assignments where each consists of an object and a manual to be followed. For each object, we took raw RGB-D images with the Microsoft Kinect sensor and stitched them using Kinect Fusion to form a denser point-cloud in order to incorporate 2Although not necessary for training our model, we also collected trajectories from the expert for evaluation purposes. different viewpoints of objects. Objects range from kitchen appliances such as ‘stove’, ‘toaster’, and ‘rice cooker’ to ‘urinal’, ‘soap dispenser’, and ‘sink’ in restrooms.