Tutorial: A beginner’s guide to crowdsourcing ML training data with Python and MTurk

Published in

Happenings at MTurk

12 min readMay 7, 2017

For many machine learning projects one of the best ways to generate training data is to crowdsource it programmatically from Amazon Mechanical Turk (MTurk) using Python. In this guide, we will walk through an end-to-end example of using Python to access MTurk.

Familiarity with MTurk is not required for this guide. In order to follow along, you just need some basic knowledge of Python. This tutorial uses Python 2, but the principles are largely the same in Python 3.

Part 1: Getting Started

First, get your tools and accounts set up as below:

Tools

1. Python — available from: https://www.python.org/downloads/

2. Pip — this should be installed by default if you install Python, but just in case, more installation instructions are available here: https://pip.pypa.io/en/stable/installing/. Pip allows you to install Python applications easily.

3. Virtualenv — this can be installed using pip:

$ pip install virtualenv

You can also get more detailed installation instructions here: https://virtualenv.pypa.io/en/stable/installation/. Virtualenv allows you to easily run Python applications in an isolated environment which reduces problems with version conflicts and permissions.

Once you have Virtualenv installed, create a directory where you want to work, set up and activate a Virtualenv environment inside it:

$ mkdir work
$ cd work
$ virtualenv .
$ source bin/activate

4. Once you have virtualenv activated, install Boto3, the official AWS SDK for Python (we will use this to access MTurk) and xmltodict, a handy Python utility for parsing XML:

$ pip install boto3
$ pip install xmltodict

Accounts
In order to connect to MTurk with Python, you will need an MTurk Requester Account and an AWS account (these are two separate accounts).

Follow the steps below:

Sign up for an AWS account at aws.amazon.com
Sign up for an MTurk account at requester.mturk.com
Go to the developer tab (https://requester.mturk.com/developer) and link your AWS account to your MTurk account (Step 2 on that screen)
MTurk also has a “Sandbox” which is a test version of the MTurk marketplace. You can use it to test publishing and completing tasks without paying any money. To use the Sandbox, you need to sign up for a Sandbox account at requestersandbox.mturk.com. You will then also need to link your AWS account to your Sandbox account from requestersandbox.mturk.com/developer.

Setup an IAM user for MTurk
You will use credentials from your AWS account when making API calls to securely authenticate yourself. The recommended way to do this is to create an “IAM” user following these steps. After you create the IAM user, keep its associated access key and secret key handy for the next step.

Connect to the MTurk Sandbox
The best place to start when writing code with MTurk is to check your account balance in the MTurk Sandbox. This is the “hello world” of MTurk.

Go back into the working folder you created earlier and activate your virtualenv settings again by typing in “source bin/activate”
Use any text editor to start a new file and type in the following:

import boto3MTURK_SANDBOX = 'https://mturk-requester-sandbox.us-east-1.amazonaws.com'mturk = boto3.client('mturk',
   aws_access_key_id = "PASTE_YOUR_IAM_USER_ACCESS_KEY",
   aws_secret_access_key = "PASTE_YOUR_IAM_USER_SECRET_KEY",
   region_name='us-east-1',
   endpoint_url = MTURK_SANDBOX
)print "I have $" + mturk.get_account_balance()['AvailableBalance'] + " in my Sandbox account"

There are a few things to notice here. The first is that we are creating an MTurk “client” using the Boto3 SDK. We then use the client to make the account balance call to MTurk. You can see a list of all available operations the client can do here.

Secondly, we are using your IAM user access keys and secret keys in here when calling the boto3.client() function.

This lets you authenticate your calls to MTurk. However, this is NOT the recommended way to deploy your code in production. The best practice is to store your credentials in a separate file on your local machine, so that they don’t get inadvertently shared with others.

Embedding keys directly is a quick way to test things, but once you have it working check out our guidelines on how best to manage credentials.

Lastly, the region_name is always ‘us-east-1’ for MTurk.

3. Save the file as “create_tasks.py” in your working folder

4. Run the file from your command line or terminal by typing in “python run.py”. If all goes well, you will see the following output:

$ I have $10000.00 in my Sandbox account

In Sandbox, the get_account_balance() call always returns $10,000. In order to connect to the live MTurk marketplace, just leave out the endpoint parameter like so:

mturk = boto3.client('mturk',
   aws_access_key_id = "PASTE_YOUR_IAM_USER_ACCESS_KEY",
   aws_secret_access_key = "PASTE_YOUR_IAM_USER_SECRET_KEY",
   region_name='us-east-1'
)

Purchasing Prepaid HITs for your account
When working with the Sandbox, you don’t need to worry about purchasing Prepaid HITs for your account. When you are ready to publish tasks to the live marketplace, you need to first buy Prepaid HITs in your account by visiting https://requester.mturk.com/account.

Each time you post a new task, MTurk will draw down your Prepaid HIT balance. When you accept the work submitted by a Worker, the balance gets transferred to her/him. This can happen automatically, or you can choose to review each task being submitted. If you reject a task, the Worker does not get paid, and no Amazon fees are collected. Instead, the balance is returned to your account.

Part 2: Creating Tasks

Now that you’re able to connect to MTurk, you are ready to start posting tasks that Workers can do. To get started, let’s review some quick concepts:

Worker: a Worker refers to anyone with an MTurk Worker account. Workers browse tasks posted on MTurk and can choose to accept a task, work on it and then submit it when it is done.

HIT: a HIT stands for “Human Intelligence Task”. A HIT is a single unit of work that you want to complete. For example, if you want to label a collection of 100 images, each of those images could be a single HIT.

Assignment: you can ask one or more Workers to complete each of your HITs. The work submitted by each Worker for each HIT is called an Assignment. So for example, if two Workers labelled each of your 100 images, you would get 2 Assignments per HIT, for a total of 200 assignments. Why would you ask more than one Worker to complete the same task twice? Because you can then compare the results from multiple people and improve the confidence and quality of your training data set.

Defining a HIT
Let’s start putting together a new HIT. Add the following to “create_tasks.py”:

question = open(name='questions.xml',mode='r').read()new_hit = mturk.create_hit(
    Title = 'Is this Tweet happy, angry, excited, scared, annoyed or upset?',
    Description = 'Read this tweet and type out one word to describe the emotion of the person posting it: happy, angry, scared, annoyed or upset',
    Keywords = 'text, quick, labeling',
    Reward = '0.15',
    MaxAssignments = 1,
    LifetimeInSeconds = 172800,
    AssignmentDurationInSeconds = 600,
    AutoApprovalDelayInSeconds = 14400,
    Question = question,
)print "A new HIT has been created. You can preview it here:"
print "https://workersandbox.mturk.com/mturk/preview?groupId=" + new_hit['HIT']['HITGroupId']
print "HITID = " + new_hit['HIT']['HITId'] + " (Use to Get Results)"# Remember to modify the URL above when you're publishing
# HITs to the live marketplace.
# Use: https://worker.mturk.com/mturk/preview?groupId=

Lets go through what these fields mean:

Title, Description and Keywords: these will help Workers understand what your task is about when browsing HITs. The Keywords help with improving the discoverability of your HIT in MTurk’s search results.

Reward: what you will pay a Worker if you approve their submitted work (it does not include fees paid to MTurk).

MaxAssignments: how many Workers you want to work on this single HIT.

LifetimeInSeconds and AssignmentDurationInSeconds: these let you specify how long you want the HIT to be available on the marketplace and how much time a Worker will have to complete the HIT once they start on it. Keep both these time limits high, unless you have a specific reason to shorten them.

AutoApprovalDelayInSeconds: you can also specify after how long a Worker’s assignment will get automatically approved if you do not explicitly approve or reject it. Keep this limit as short as possible. By default assignments will be automatically approved after 2 days.

Question: this contains a string of HTML or XML content you specify to define what the layout looks like.

In this example, this field is being populated by reading a file called “questions.xml” which we have not yet created. This file will define what our task will actually look like for Workers. Let’s walk through that next.

Defining the task layout
An MTurk HIT is an HTML document containing a simple form. You can customize everything about the document using HTML, CSS and Javascript. You can add any number of images, text fields, radio buttons, check boxes and so on. You can also link to external resources like Bootstrap, jQuery or React. Your HTML will be rendered and loaded inside an iframe that has a height of 600px by default unless you specify something different.

Here, we will keep it simple, and design a layout with one text field and some instructions. Create a new file called “questions.xml” and add in the following:

<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
<HTMLContent><![CDATA[<!-- YOUR HTML BEGINS -->
<!DOCTYPE html>
<html>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
<script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
</head>
<body>
<form name='mturk_form' method='post' id='mturk_form' action='https://www.mturk.com/mturk/externalSubmit'><input type='hidden' value='' name='assignmentId' id='assignmentId'/><h2>Is this Tweet happy, angry, excited, scared, annoyed or upset? Type in one word to describe the main emotion in the message. If it is unclear, type in "unclear".</h2><h3> Tweet: "I am really looking forward to the next Seahawks game!"</h3><div>
  <input type='text' name='reported_emotion' placeholder='Type in your answer here'>
</div>
<p><input type='submit' id='submitButton' value='Submit' /></p></form><script language='Javascript'>turkSetAssignmentID();</script></body></html>
<!-- YOUR HTML ENDS -->]]></HTMLContent><FrameHeight>600</FrameHeight></HTMLQuestion>

Why is there XML in here? What’s happening is that your HTML code is being wrapped inside an XML object before being sent to MTurk. It’s possible to use this same wrapper to define your layout using just XML and not HTML. In that case, the MTurk API will take care of the look and feel of your HIT for you, and your ability to customize the experience is limited.

For our purposes, you can ignore the XML and just focus on your HTML code inside the wrapper.

Note that it is important to give your input fields a “Name” attribute. You will use that to keep track of the responses later.

Once you save questions.xml, try running your Python code again and see your HIT published in the MTurk Sandbox.

As you can probably see this is not a very well designed task. For example, it would be a better idea to present Workers with a list of emotions as radio buttons or checkboxes, instead of asking them to type it in each time.

We will go with this for now, but when you are creating your HITs it is worth investing time into thinking through the design carefully. Our Best Practices Guide contains more useful tips you can use for this.

Lastly, this questions file works for a single HIT. What would you do if you wanted to publish 50 HITs about 50 different images you needed to tag? You could read in this file as a string, loop through your list of 50 image URLs and use string substitution to inject each URL into the string before creating a HIT. You can probably think of a few other ways of doing this too.

When you run this code, you will get back a “HITId” when you create the HIT. This is a unique ID that you can later use to get results (“Assignments”) being submitted by Workers for your task. Let’s see how that works.

Part 3: Retrieving Results

Create a new Python file called “get_results.py” and save it to the same working directory. Set it up like your “create_tasks.py” file from before, and use the HITId that you got back from creating your HIT.

import boto3mturk = boto3.client('mturk',
   aws_access_key_id = "PASTE_YOUR_IAM_USER_ACCESS_KEY",
   aws_secret_access_key = "PASTE_YOUR_IAM_USER_SECRET_KEY",
   region_name='us-east-1',
   endpoint_url = MTURK_SANDBOX
)# You will need the following library
# to help parse the XML answers supplied from MTurk
# Install it in your local environment with
# pip install xmltodict
import xmltodict# Use the hit_id previously created
hit_id = 'PASTE_IN_YOUR_HIT_ID'# We are only publishing this task to one Worker
# So we will get back an array with one item if it has been completedworker_results = mturk.list_assignments_for_hit(HITId=hit_id, AssignmentStatuses=['Submitted'])

This will return a Python dictionary and if Workers have submitted any Assignments, they will show up in an array with the key “Assignments”. Each Assignment is itself a dict and has the following structure:

{
    'AssignmentId': 'string',
    'WorkerId': 'string',
    'HITId': 'string',
    'AssignmentStatus': 'Submitted'|'Approved'|'Rejected',
    'AutoApprovalTime': datetime(2015, 1, 1),
    'AcceptTime': datetime(2015, 1, 1),
    'SubmitTime': datetime(2015, 1, 1),
    'ApprovalTime': datetime(2015, 1, 1),
    'RejectionTime': datetime(2015, 1, 1),
    'Deadline': datetime(2015, 1, 1),
    'Answer': 'string',
    'RequesterFeedback': 'string'
}

The actual input from the Worker is stored in the “Answer” field, and it is an XML string. The responses entered by the Worker need to be extracted from the string. There are many ways to do this and below we show you one option using the “xmltodict” module that you installed earlier. Update your “get_results.py” file and add:

if worker_results['NumResults'] > 0:
   for assignment in worker_results['Assignments']:
      xml_doc = xmltodict.parse(assignment['Answer'])
      
      print "Worker's answer was:"
      if type(xml_doc['QuestionFormAnswers']['Answer']) is list:
         # Multiple fields in HIT layout
         for answer_field in xml_doc['QuestionFormAnswers']['Answer']:
            print "For input field: " + answer_field['QuestionIdentifier']
            print "Submitted answer: " + answer_field['FreeText']
      else:
         # One field found in HIT layout
         print "For input field: " + xml_doc['QuestionFormAnswers']['Answer']['QuestionIdentifier']
         print "Submitted answer: " + xml_doc['QuestionFormAnswers']['Answer']['FreeText']
else:
   print "No results ready yet"

This code parses the string in the “Answer” field and converts it to a Python dict. Inside the dict, each input field that you had in your question.xml file is available as a dictionary. Each input field’s dictionary has a “QuestionIdentifier” key that contains the name of the input field that you set in your HTML layout. The Worker’s input is stored in the “FreeText” key. If you have multiple fields in your HIT, you will get back an array of results which slightly changes how to parse and retrieve the answers — this is reflected in the code sample.

Now that you have the results submitted by a Worker, you can do one of three things: approve the Assignment, reject the Assignment or do nothing. If you approve, the Worker will be paid for the work submitted and the Assignment will be marked as approved.

You also have the option to pay a Worker a “bonus” that is an additional amount separate from the reward amount associated with the task. This allows you to offer variable, performance based rewards to Workers. To send a bonus to a Worker, use the “mturk.send_bonus()” operation.

If you reject the Assignment, it will be marked as rejected and the Worker will not be paid. When rejecting work, you must include a reason why you are rejecting the task. Generally speaking, tasks submitted by Workers should only rarely be rejected in cases where a Worker is clearly submitting malicious results (such as leaving everything blank, or typing in the same text over and over again).

If you don’t do anything, the Assignment will be automatically approved after a set time.

Part 4: Do More With MTurk

Congratulations for making it this far! You should now have a good sense of what it’s like to work with MTurk using Python generally. There is a lot more you can do next with MTurk, and here are a few good jumping off points:

See the full list of operations you can use in Python
Learn about Qualifications, a mechanism you can use to target work to Workers meeting specific criteria (this post uses Ruby examples, but can be easily converted to Python).
Use MTurk in the cloud, using AWS Lambda: great for deploying applications in production and not on the dusty machine under your desk
Learn how to let MTurk Workers use bounding boxes to annotate images

If you have any questions, please post a question to our MTurk forums. To become a Requester, sign up here. Want to contribute as a Worker customer? Get started here.

Tutorial: A beginner’s guide to crowdsourcing ML training data with Python and MTurk

Part 1: Getting Started

Part 2: Creating Tasks

Part 3: Retrieving Results

Part 4: Do More With MTurk

Written by Amazon Mechanical Turk