AI Weekly Digest #24: GPT-4 Vision, everything you need to know

Potential Applications, Jailbreaking, API Access & Alternatives!

AI Weekly Digest #24: GPT-4 Vision, everything you need to know

Hello, tech enthusiasts! This is Wassim Jouini and Welcome to my AI newsletter, where I bring you the latest advancements in Artificial Intelligence without the unnecessary hype.

You can find me on LinkedIn, Twitter and Medium! Let’s connect!

Now let's dive into this week's news and explore the practical applications of AI across various sectors.

TL;DR

OpenAI’s GPT-4 Vision (GPT-4 V) is a new feature that allows ChatGPT Plus subscribers to upload images for generating relevant responses, applicable in various fields such as education, travel, translation, web development, data analysis, and e-commerce.

Although it has strict controls to prevent misuse, including limitations on facial recognition and handling explicit content, determined users have found ways to bypass these measures.

The API for GPT-4 V is expected to be announced soon, with speculation around rate limits and pricing, although it is anticipated to be a costly option.

GPT-4 V provides a broad application range and simplifies experimentation, but it may not offer the precision or cost-effectiveness of specialized ML models.

Open-source alternatives like LLaVA 1.5 are progressing, offering promising capabilities, though GPT-4 V remains the leading model in this domain.

GPT-4 Can See! What You Need to Know

Following our two latest newsletter, we continue this week with OpenAI’s “GPT-4 Vision” release!

Potential Applications, Jailbreaking, API Access & Alternatives!
Everything you need to know!

#1 GPT-4 V(ision): what is it, and potential applications

GPT-4 Vision is

  • a new feature from OpenAI. It is now accessible to all ChatGPT Plus subscribers.

  • Users can upload an image and prompt the model to generate relevant responses (see examples below).

This AI tool is capable of recognizing various elements within an image, providing detailed descriptions based on the visual content.

Potential applications span a wide spectrum of topics ; to name just a few:

  • Educational material analysis: it can answer your homework

  • Travel: Identifying landmarks while traveling

  • Translation, e.g., translating texts written in ancient languages, found in historical documents or comic strips.

  • Sketch to Code, GPT-4 Vision offers unique functionalities for web developers, as it can convert images into code.

  • Data Analysis: It is also proficient in interpreting diagrams and graphs, regardless of the original format, which could be beneficial for data analysis professionals, including data analysts and data scientists.

  • OCR-free data extraction: enabling you to capture and structure information from any document.

  • e-Commerce: for instance, you can ask it to suggest a description for a product.

#2 Limitations and Jailbreaking

In principle, OpenAI has implemented strict controls to mitigate potential misuse of GPT-4 Vision, ensuring adherence to privacy and security standards.

  • Facial Recognition: The AI is programmed to be unable to identify individuals, rejecting most of such requests.

  • Explicit content: Furthermore, it handles explicit content cautiously, focusing on describing the non-explicit elements of images.

  • Reliability: OpenAI has also taken steps to minimize the AI’s propensity to produce inaccurate information, enhancing the reliability of GPT-4 Vision for various applications.

Is it possible to bypass these safety measures? The short answer is… yes! Here a are a couple of exemples:

  • (1) face recognition: when the model refuses to identify faces, you can trick it by pretending that' it’s not a real photo but rather a painting.

  • (2) model behavior altering: you can hide in a document invisible text to the naked eye (e.g., off-white text on a white background). The invisible text can prompt the model to behave differently, altering your products expected behavior!

Fore GPT-4 V to recognize faces - Source

Hidden text in the white image to alter GPT-4 V model - Source

#3 API Access, Rate limits and Pricing

There is no API access yet to this model, we can however speculate:

  • API Access announcement: we’re expecting to learn more in a couple of weeks during the first OpenAI Developer Conference (Nov 6 2023). This new API would open the way to a wide range of applications and we’re all excited to get our hands on it!

  • Rate Limits: We can expect the API to be limited in terms of number of calls and number of tokens (at least as much as current GPT4 APIs). GPUs are still a limiting factor today and it’s hindering our ability to scale Generative AI based applications unfortunately.

  • Pricing: can’t say much about the pricing model at this moment ; we are expecting it to be at least as expensive as GPT-4 base model though! This will be a factor to determine whether GPT4-V could potentially replace classic ML approaches (e.g., OCR, object detection, image classification, and so on!)

#4 Can It Replace classic ML approaches?

GPT-4 V can classify images, box objects (to find elements in an image), extract data without OCR, and so on.

So… can it replace usual ML approaches? here’s my subjective opinion on the matter through these three examples:

  • Image Classification:

    • GPT-4 Vision: Utilizes a vast knowledge base and context understanding to classify images. However, it may not always provide the level of granularity and specificity that specialized image classification models can offer.

      • Pros: Capable of handling a wide variety of images and providing context-rich descriptions. Doesn’t require data labeling and model training from scratch!

      • Cons: May lack the precision of models trained specifically for certain domains or tasks. It will be slower and way more expensive per request compared to a simple image classifier.

  • Object Detection:

    • GPT-4 Vision: Can identify and provide descriptions of objects within an image. However, it might not provide bounding box coordinates or the level of detail that specialized object detection models can offer.

      • Pros: Able to provide descriptive context about the objects and their relationships.

      • Cons: Lacks the precision in localization and might not perform well in crowded scenes. E.g., Yolo models can be fine-tuned, are extremely fast during inference, and they are very cheap to train.

  • Data Extraction:

    • GPT-4 Vision: Offers OCR-free data extraction, allowing for the retrieval of information from documents and images without the need for explicit optical character recognition.

      • Pros: Simplifies the data extraction process, potentially saving time and resources. Can '“naturally” capture text relationships in tables for instance.

      • Cons: Slow and expensive compared to OCR based approaches. For instance, relying on an open source OCR associated with a large language model would cost less than 0.5ct per task.

Conclusion: Overall GPT-4 V can be used off the shelves, thus simplifying experimentation! It avoids labeling data & training a model. Yet, it comes at a high price, and possibly a degraded performance compared to a specialized model.

#5 Open Source Alternatives

I already covered an open source alternative, know as miniGPT4, in this blog post.

MiniGPT-4, released in April 2023, served as a proof of concept. While its performance did not rival that of GPT-4-V, it demonstrated the feasibility of emulating GPT-4-V using solely open-source models.

Since then, the open-source community has made significant progress. The most notable and robust alternative to GPT-4 V available today is LLaVA 1.5. Not only is it comparable to GPT-4 V, but it also allows users to fine-tune the model on their own data as needed.

Comparaison between LLaVA and GPT-4 V - Source

Early experiments indicate that LLaVA 1.5 showcases impressive multimodal chat capabilities, at times mirroring the behaviors of multimodal GPT-4 when responding to unseen images or instructions. It has achieved an 85.1% relative score compared to GPT-4 on a synthetic multimodal instruction-following dataset. The gap is closing, but there is still progress to be made (see example comparing GPT-4 V and LLaVA below). Despite this, GPT-4 V remains the leading model in the field… for now!

This is it for Today!

More resources:

Until next time, this is Wassim Jouini, signing off. See you in the next edition!

Have a great Sunday and may AI always be on your side!