Use AI to Write Captions for Images with Cloudsight + Python

There are lots of solutions out there on the market today for using Artificial Intelligence and Machine Learning to tag images. Solutions from IBM Watson, Imagga, Cloudsight, Google Vision and Microsoft Azure all perform well, and with services like Algorithmia, you can easily spin up and train your own image tagging network.

Writing human-readable, sentence-length captions for images, though, is a harder task. The sentence you generate not only needs to accurate describe the objects in the scene, but it needs to capture their relationships to each other, context, etc. And it needs to know what’s actually important — no one wants a image description that goes into excruciating detail about every object on a table or every plant in the background. When AI knows too much, it can be worse than knowing too little.

Cloudsight, a startup based out of Los Angeles, is working on addressing the challenge of using Artificial Intelligence and Machine Learning to automatically write sentence-length, human-readable captions for images.

They offer fully customized, human-in-the-loop systems for specific applications, but they also have a general model which you can start testing out very quickly, and which returns some impressive results. This is their AI only system, but again, you can plug it into a human team as well and get even better output (at a higher price and with more training time).

Getting Started

To get started with using the Cloudsight API for testing, you can’t just sign up for an account, like you could with IBM or Google Vision. Cloudsight is an emerging company, and they like to have a hands on collaboration with potential new clients, especially during early testing.

The good news is that they’re very responsive. Reach out via the contact form on their website and they’ll usually be back in a few hours. You might even hear from their CEO, or a senior member of the team. Cloudsight is usually fine with providing API access and a certain number of free API calls for testing. I got 1,000 free API calls to get started.

Once you’re in touch and set up, they’ll give you an API key.

Preparing Python

Conveniently, Cloudsight has a Python library all ready to go.

Install it with pip.

pip install cloudsight

If you don’t have it already, I also like to install PIL via Pillow, an image manipulation library.

pip install pillow

Preprocessing

First, choose an image to process. I’ll use the one at the top of this article, taken by a photographer in New Zealand that works with my company.

Set up the basic imports.

from PIL import Image

import cloudsight

I like to begin by downsizing the image to a standard size, via PIL’s thumbnail function. This makes it much smaller and easier to upload. If you don’t do it on your end, Cloudsight will do it for you, but again it’s easier to send fewer bytes, so you might as well downsize on your side.

This is standard stuff — you’re taking in the image, using the thumbnail function to make it a consistent size, and then saving it out again.

im = Image.open('YOURIMAGE.jpg')

im.thumbnail((600,600))

im.save('cloudsight.jpg')

Making the Call

Now that you have a properly processed image, you can make the actual API call. First, you authenticate using your API key.

auth = cloudsight.SimpleAuth('YOUR KEY')

Next, open your image file, and make the request itself:

with open('cloudsight.jpg', 'rb') as f:
    response = api.image_request(f, 'cloudsight.jpg',  {'image_request[locale]': 'en-US',})

The next bit is a little unexpected. Because Cloudsight sometimes has humans in the loop, you need to give them a few seconds or more to complete their bit of the request. This is a little different from API requests to a tagging service like Watson, which tend to complete instantly.

With Cloudsight, you can call the wait function, and define a maximum waiting time. 30 seconds is generally plenty.

status = api.wait(response['token'], timeout=30)

Finally, print out the response!

print status

The Response

Cloudsight will give you back an object with the response data. Here’s what I got for the image above.

{u'status': u'completed', u'name': u'man in gray crew neck t-shirt sitting on brown wooden chair', u'url': u'https://assets.cloudsight.ai/uploads/image_request/image/734/734487/734487841/cloudsight.jpg', u'similar_objects': [u'chair', u't-shirt', u'crew neck'], u'token': u'fgo4F5ufJsQUCLrtwGJkxQ', u'nsfw': False, u'ttl': 54.0, u'structured_output': {u'color': [u'gray', u'brown'], u'gender': [u'man'], u'material': [u'wooden']}, u'categories': [u'fashion-and-jewelry', u'furniture']}

This is very cool stuff! As you can see, the system captioned the image “Man in gray crew neck t-shirt sitting on brown wooden chair.” That’s pretty spot on!

In addition to the unstructured sentence, the API also gave me back structured data about the image, like colors in the image (gray, brown), genders of the people (man), categories, and materials present.

The combination of structured and unstructured data in a single call is very helpful — you could do tagging or semantic mapping, while also getting a human readable sentence to describe the image.

Here’s the full program.

from PIL import Image

import cloudsight

im = Image.open('YOUR IMAGE')

im.thumbnail((600,600))

im.save('cloudsight.jpg')

auth = cloudsight.SimpleAuth('YOUR API KEY')

api = cloudsight.API(auth)

with open('cloudsight.jpg', 'rb') as f:

response = api.image_request(f, 'cloudsight.jpg', {'image_request[locale]': 'en-US',})

status = api.wait(response['token'], timeout=30)

print status

It’s sentence-length image captioning, via Artificial Intelligence, in about 10 lines of code!

Where to Go From Here

If you want to scale up your use of Cloudsight after testing, the team can help get you ready for production. They don’t quote generic pricing — it all depends on your projected volume, use case, and any extra training required.

If the output from the general model is decent but not perfect, you can also work with the team to tailor it to your specific use case. I’m working with them on bringing humans partially into the loop to handle processing historical images, for example. The pricing will likely be higher than a pure AI solution, but it will also be much more specific to my use case. That’s the great part about working with an emerging company, since they can be much more responsive to your specific needs.

If you’re curious about automatic image captioning, check Cloudsight out! Again, with Python, you can start captioning your images — or just play around with this new capability — in a few minutes and very little code at all!

This was originally written in 2019. You’ll likely need to update this code for Python 2.0, but the fundamentals remain the same.