OpenAI Vision

GPT-4 with Vision, also known as GPT-4V or gpt-4-vision-preview in its API form, enables the model to process images and respond to queries about them. Traditionally, language models were restricted to a single input type, text, which limited the scope of applications for models such as GPT-4 in various scenarios.

GPT-4 with vision is now accessible to all developers who have GPT-4 access through the gpt-4-vision-preview model and the updated Chat Completions API, which now accommodates image inputs. However, it's important to note that the Assistants API does not currently support image inputs.

Key points to remember include:

  • GPT-4 with vision does not represent a distinct model that diverges significantly from GPT-4, except for the specific system prompt used for the model.
  • It is not an alternate version of GPT-4 that performs less effectively on text-based tasks due to its vision capabilities; rather, it is essentially GPT-4 enhanced with vision features.
  • The vision aspect of GPT-4 is an enhancement that broadens the model's capabilities.

Getting Started

There are two primary methods to provide images to the model: either by submitting a link to the image or by directly including the base64 encoded image in the request. Images can be incorporated in messages from the user, system, and assistant. As of now, images are not supported in the initial system message, but this might be subject to change in the future.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

The model excels at responding to broad questions about the contents of images. It can comprehend the relationships between objects within images, but it's not fully optimized for answering intricate questions about the specific locations of objects in an image. For instance, it can identify the color of a car or suggest dinner ideas based on the contents of your fridge. However, if you present an image of a room and ask where the chair is, the model might not provide an accurate answer.

As you explore the applications of visual understanding, it's crucial to be aware of these limitations of the model.

Submitting Base64 Encoded Images

If you possess an image or a collection of images on your local device, you can transmit them to the model in a base64 encoded format. Here's an example demonstrating how this works:

import base64
import requests

# OpenAI API Key
api_key = "YOUR_OPENAI_API_KEY"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

Handling Several Image Inputs

The Chat Completions API can handle and process multiple image inputs, whether they are in base64 encoded format or provided as image URLs. The model will analyze each image and utilize the information gathered from all of them to formulate a response to the question.

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What are in these images? Is there any difference between them?",
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
print(response.choices[0])

In this scenario, the model is presented with two identical images and can respond to queries about both or each image separately.

Options for Image Detail Level

You can control the level of detail in the model's image processing and textual interpretation through the detail parameter, which offers three settings: low, high, or auto. By default, the model operates in auto mode, where it assesses the size of the input image to determine whether to apply low or high settings.

  • Low Detail: This setting deactivates the "high res" model. The model processes a low-resolution version of the image at 512px x 512px, representing it with a 65-token budget. This mode is suitable for faster responses and lower token consumption, ideal for scenarios where high detail is not necessary.
  • High Detail: Activating the "high res" mode allows the model to first view the low-resolution image and then generate detailed 512px square crops from the input image, based on its size. Each detailed crop utilizes double the token budget (130 tokens in total), providing a more detailed analysis of the image.
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            "detail": "high"
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)

Handling Images in the Chat Completions API

The Chat Completions API, unlike the Assistants API, does not maintain state. This means you need to manage the messages, including images, that you send to the model. If you wish to use the same image multiple times, you must include it with each API request.

For extended conversations, it's recommended to use image URLs rather than base64 encoding. To enhance the model's response time, consider resizing images before submission to below their maximum expected size. For low-resolution mode, images should be around 512px x 512px. In high-resolution mode, the shorter side of the image should be under 768px, and the longer side should not exceed 2,000px.

Once processed, images are deleted from OpenAI's servers and are not used for model training.

Understanding the Model's Limitations

While GPT-4 with vision is versatile, it's important to recognize its limitations:
  • Medical Images: The model is not designed for interpreting specialized medical images like CT scans and should not be used for medical advice.
  • Non-English Text: Performance may decline with non-Latin alphabets, such as Japanese or Korean.
  • Text Size: Enlarging text in images can aid readability, but avoid cutting off crucial details.
  • Image Rotation: Rotated or upside-down text and images might be misinterpreted.
  • Visual Elements: Understanding graphs or text with varying colors or styles (like solid, dashed, or dotted lines) can be challenging.
  • Spatial Reasoning: The model has difficulty with tasks requiring precise spatial localization, like identifying chess positions.
  • Accuracy: There may be inaccuracies in descriptions or captions under certain conditions.
  • Image Shape: Panoramic and fisheye images pose challenges.
  • Metadata and Resizing: Original file names and metadata are not processed, and resizing affects the original dimensions of images.
  • Counting: The model may only provide approximate counts for objects in images.
  • CAPTCHAs: For safety, the submission of CAPTCHAs is blocked.

In conclusion

We have gathered some knowledge about OpenAI Vision and how you could get started using it. This OpenAI Vision has a lot of use-cases and real-life application to help humanity in diverse ways. I hope you enjoyed this tutorial, stay tuned in for more exciting contents.

A quick note I am working on a future tutorial and I am going to be building an AI powered eye for people visual imperative, stay tuned.