AWS SageMaker In Action

Although Machine Learning is not new but recently it is quickly going forward thanks to public cloud providers such as AWS. By introducing SageMaker AWS is making Machine Learning more accessible and even more affordable to developers and data scientists. When combined with other AWS services such as Glue which facilitates data engineering, AWS becomes a perfect place for practicing and deploying machine learning applications.

In this post we will not go over details of these services as AWS documentation is really informative and well organized. Instead we get our hands dirty by using them to see how they are useful for someone with little experience on Machine Learning. Most of the examples on AWS website are about image processing and I wanted to experience as much SageMaker tools and services as possible and from scratch. So, I defined a use-case of type text classification. We are a cloud company and it’s important for us to know what’s going on in public cloud! I want to build an analyzer who can say how much of the tweets posted on twitter about AWS are technical and how much marketing and commercial? The same with Azure so that no one says I’m biased! Anyway! Let’s start!

The following is a general overview of the procedure. Overview

Step 0: Collect Data

We will use supervised learning techniques. In supervised learning, some datasets are required: Training, Validation, and Test datasets. We will talk about how to prepare data but the first step is to collect some relevant data. For this purpose we developed a simple Lambda function that collects tweets with a specific hashtag and stores in AWS S3. You can find the code for the custom authorizer on Github

import boto3
import tweepy
import csv
import os
import datetime

def lambda_handler(event, context):
    ####input your credentials here
    consumer_key = os.environ['CONSUMER_AUTH_KEY']
    consumer_secret = os.environ['CONSUMER_AUTH_SECRET']
    access_token = os.environ['ACCESS_TOKEN']
    access_token_secret = os.environ['ACCESS_SECRET']

    hashtag = os.environ['HASHTAG']
    # hashtag = event["hashtag"]
    datalake_bucket = os.environ['DATALAKE_BUCKET']

    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
    # Open/Create a file to append data
    csvFile = open('/tmp/hashtag.csv', 'a')
    #Use csv Writer
    csvWriter = csv.writer(csvFile)

    hashtag_entry = f"#{hashtag}"
    s3 = boto3.resource('s3')

    yesterday = - datetime.timedelta(days=1)
    y_day = yesterday.strftime("%Y-%m-%d")
    print ("y_day: ", y_day, " yesterday: ", yesterday)

    for tweet in tweepy.Cursor(,q=hashtag_entry,count=100,
    now =
    now_dir = now.strftime("%Y-%m-%d")

We will use this Lambda function in several ways but here it’s for collecting data. Each run will store all the tweets with specific hashtag in the past 24 hours. An example of the output is:

aws,vmiss,2019-01-04 17:52:54,b'The Truth About Virtual Machines In The Cloud #azure #aws #gcp #cloud #ITPro'
aws,Lily's Social Media and Online Training,2019-01-04 17:50:45,b'JANUARY is for #AWS #CERTIFICATION\nFEATURED COURSES\nAWS Certified Solutions Architect - Professional 2019\n\nACE the\xe2\x80\xa6'
aws,Jose Hidalgo Garcia,2019-01-04 17:50:39,b'RT @adhorn: My new post is out! Injecting Chaos to #AWS Lambda functions using Lambda Layers. Hope you enjoy :-) #se\xe2\x80\xa6'
aws,Angel Alejos,2019-01-04 17:50:21,b'RT @adhorn: My new post is out! Injecting Chaos to #AWS Lambda functions using Lambda Layers. Hope you enjoy :-) #se\xe2\x80\xa6'
aws,Gwen L Holland,2019-01-04 17:49:34,"b'Ace Info Solutions, Inc. is hiring a Cloud Automation Engineer in Bowie, MD #job #AWS #GovCloud'"
aws,josacar,2019-01-04 17:46:33,b'RT @adhorn: My new post is out! Injecting Chaos to AWS Lambda functions using Lambda Layers. Hope you enjoy :-) #se\xe2\x80\xa6'
aws,HYBRID TECHNOLOGY SYSTEMS,2019-01-04 17:46:02,b'How to install the Passbolt Team Password Manager on Ubuntu 18.04 via @hybrid_ts #AWS #Cloud'
aws,451 Research,2019-01-04 17:45:02,"b'#AWS may provide the tools, but expert #partner assistance will be needed to fully take advantage of those tools. F\xe2\x80\xa6'"
aws,Intelligent Edge,2019-01-04 17:43:48,b'RT @IoTGN: Why deploying operational tech data to an OT/IT #cloud matters @MoxaInc\n #IIoT #EdgeComputing #IT #AWS #\xe2\x80\xa6'
aws,Bob Harris,2019-01-04 17:43:41,b'RT @richmerrett815: Very proud to have been involved with setting this new #AWS Alexa Skill Builder Certification. A great few days with so\xe2\x80\xa6'
aws,Sergio Cuéllar ☁️,2019-01-04 17:42:44,"b'RT @jeffbarr: Really cool - #AWS CLI Builder - - ""Build your own AWS CLI commands...""'"
aws,AirwaySim,2019-01-04 17:41:47,b'Electrical blackout at Alicante - El Altet - flights cancelled #AWS #gameStatus #BW1'
aws,Whizlabs,2019-01-04 17:41:01,b'Best #Books for #AWS Certified Solutions #Architect #Exam via @whizlabs'

Step 1: Prepare Training Data

As mentioned we use supervised learning techniques. So, we need some data to train the model (training dataset). From AWS documentation:

The type of data that you need depends on the business problem that you want the model to solve (the inferences that you want the model to generate). For example, suppose that you want to create a model to predict a number given an input image of a handwritten digit. To train such a model, you need example images of handwritten numbers. Data scientists often spend a lot of time exploring and preprocessing, or “wrangling,” example data before using it for model training. To preprocess data, you typically do the following: Fetch the data— You might have in-house example data repositories, or you might use datasets that are publicly available. Typically, you pull the dataset or datasets into a single repository.

Although there are various open datasets but they are not usable in our scenario and we need to prepare training data ourselves. Basically we need to choose some data from what we collected in Step 0 and label those data. The amount of data should be big enough so that a good model can be created. This can be elaborated but again here we are mostly focused on how to use SageMaker rather than data science and data engineering. To label this piece of data we can use our own human resources but usually it’s not easy to find people who can put time to label data. For example in this case they should go over thousands of tweets. AWS can help us to do this. Recently AWS launched Ground Truth: Ground Truth

We used Ground Truth by creating a new labeling job. It’s almost straight forward. When creating a labeling job you specify the location of the file to be processed in an S3 bucket, you specify an S3 bucket as output location and of course IAM role with proper access to S3 buckets involved. Also you should specify the type of content that should be labeled. Labeling Job

In the next step it will ask you about the workforce you want to engage. Workforce can be public, your own employees or external verified people. Labeling Job

In this case we don’t have any private information, so public workforce is ok. Actually AWS was providing this service as AWS Mechanical Turk but it’s integrated in AWS SageMaker page now. Also as you can see in the screenshot above, you should specify labels and give some instructions and examples to mechanical turks. Number of workers are adjustable as well. Labeling job can be costly, so before trying please check the pricing page for Ground Truth here!

Anyway, the output of labeling job would be an augmented manifest file. It’s actually a json file specifying some details and a label per entry. An example of some entries in this manifest file is as follows:

{"source":"Chidambara .ML.,2019-01-04 17:34:02,b\u0027RT @ThingsExpo: CloudEXPO Silicon Valley Show Prospectus Published \\n\\n\\n\\n@Geek_King @TotalUptime #Cloud #IoT #IIoT #CI\\xe2\\x80\\xa6\u0027","hashtag-positivity-label-clone":1,"hashtag-positivity-label-clone-metadata":{"confidence":0.62,"job-name":"labeling-job/hashtag-positivity-label-clone","class-name":"Marketing","human-annotated":"yes","creation-date":"2019-01-08T18:53:17.309395","type":"groundtruth/text-classification"}}
{"source":"Ashot Nalbandyan,2019-01-04 17:33:00,b\u0027RT @dr_vitus_zato: How to Growth Stack Your Product \u003d\u0026gt; #javascript #vuejs #code #php #angular #reactjs #redux #css\\xe2\\x80\\xa6\u0027","hashtag-positivity-label-clone":1,"hashtag-positivity-label-clone-metadata":{"confidence":0.92,"job-name":"labeling-job/hashtag-positivity-label-clone","class-name":"Marketing","human-annotated":"yes","creation-date":"2019-01-08T20:23:38.667523","type":"groundtruth/text-classification"}}
{"source":"InfoSec Industry,2019-01-04 17:32:32,b\u0027Kubernetes Security Issues (CVE-2018-18264 and kubectl proxy) #AWS #infosec\u0027","hashtag-positivity-label-clone":0,"hashtag-positivity-label-clone-metadata":{"confidence":0.66,"job-name":"labeling-job/hashtag-positivity-label-clone","class-name":"Technical","human-annotated":"yes","creation-date":"2019-01-08T17:08:04.949567","type":"groundtruth/text-classification"}}
{"source":"Jamed,2019-01-04 17:30:42,b\u0027RT @NearShore_Tech: #Python opportunity in #Puebla and #Merida - #Django #AWS #Docker #Flask  Send your resume \\xe2\\x86\\x92 careers@nearshoretechnolog\\xe2\\x80\\xa6\u0027","hashtag-positivity-label-clone":1,"hashtag-positivity-label-clone-metadata":{"confidence":0.83,"job-name":"labeling-job/hashtag-positivity-label-clone","class-name":"Marketing","human-annotated":"yes","creation-date":"2019-01-08T19:37:00.318122","type":"groundtruth/text-classification"}}
{"source":"Lily\u0027s Social Media and Online Training,2019-01-04 17:30:36,b\u0027JANUARY is for #AWS #CERTIFICATION\\nFEATURED COURSES\\n\\nAWS Certified Developer Associate 2019\\nPass the AWS Certified\\xe2\\x80\\xa6\u0027","hashtag-positivity-label-clone":0,"hashtag-positivity-label-clone-metadata":{"confidence":0.74,"job-name":"labeling-job/hashtag-positivity-label-clone","class-name":"Technical","human-annotated":"yes","creation-date":"2019-01-08T19:05:36.475779","type":"groundtruth/text-classification"}}

Now we have some labeled data which can be used by training jobs and ready to proceed with next step.

Step 2: Training the Model

In theory we are ready to train our model. Let’s have a look at the following diagram which shows how training works: ML workflow For a practical example using ready, labeled data you can see AWS guide.

If you follow AWS guides, you see that the key for all Machine Learning operations is a Notebook instance. Those who have worked with Jupyter Notebooks already know how it works. Using GUI provided by Notebooks you can easily test your code or visualize the results, … Required tools and libraries for developers are installed on the underlying machine. We can use Notebooks to train the model but recently AWS SageMaker added an option to facilitate modeling without the need to launch a Notebook and do some development. We wanted to give it a try:

Training Job

As you see you will get an interface which you can specify the algorithm to be used and also Hyperparameters which are related to that specific algorithm. By using training jobs GUI, you don’t have much visibility to what happens in the background and the idea is that the model is created for you. My personal experience was that this option is not helpful and doesn’t work as expected and because of the lack of visibility you can’t troubleshoot. Out of curiosity I tried this option to create a model but I had no success after more than 20 tries! I contacted AWS Support and they confirmed that they could regenerate the errors I was getting and they started investigation but after 1 month they couldn’t figure it out:

As an update, I am working with SageMaker Experts to figure out a workaround for the same. I will surely update you on the case the moment I will have something substantial to put forward to you.

So, I forgot this way of running training jobs. The better way to run training jobs would be to use Jupyter Notebooks to have more control over the operations. Using Notebooks in SageMaker we can fetch training data and proper training code (as containers), launch AWS ML instances and last but not least use training data and training code on ML instances to train the model. It saves the resulting model artifacts and other output in the S3 bucket we specified for that purpose.

For this practice we will use an algorithm provided by AWS SageMaker to train the model. The algorithm we chose is BlazingText. Based on some investigation BlazingText algorithm can be used for text classification which fits our use case. The important thing is the input to the training code. Because we use available algorithms, the format of training data which is the input should follow the algorithm’s specifications. As the guide for BlazingText reads:

For supervised mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string ‘__label__’. Here is an example of a training/validation file: “__label__4”. linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp . __label__2 bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .

Step 2-1: Transforming Training Data

If you look at our training data, you see that the format is different. Even if we want to use manifested file as input, still some data engineering is required to transform the data. To transfer and clean the data we used another great service by AWS: AWS Glue We defined a Glue job to convert JSON file to a simple CSV as expected by our desired algorithm. It’s pretty easy. As you see in the following picture we can easily map one field to a column in output or even remove unnecessary fields:

Glue Job

The result is the following which with a little tweak will be fit for BlazingText:

1,"Chidambara .ML.,2019-01-04 17:34:02,b'RT @ThingsExpo: CloudEXPO Silicon Valley Show Prospectus Published \n\n\n\n@Geek_King @TotalUptime #Cloud #IoT #IIoT #CI\xe2\x80\xa6'"
1,"Ashot Nalbandyan,2019-01-04 17:33:00,b'RT @dr_vitus_zato: How to Growth Stack Your Product => #javascript #vuejs #code #php #angular #reactjs #redux #css\xe2\x80\xa6'"
0,"InfoSec Industry,2019-01-04 17:32:32,b'Kubernetes Security Issues (CVE-2018-18264 and kubectl proxy) #AWS #infosec'"

We can now move to next phase which is actual training a model.

To be continued …