OPENAI

ChatGPT's application capabilities in the visual domain - Advanced Level 3

October 13, 2023

Scene text, table, chart and document reasoning

: Accurately identifies handwritten and printed text within a scene.
: Can identify a right triangle and determine its side lengths AB as 4 units and BC as 3 units.
: Accurately interprets the beginning and end of a proposal process.
: Identifies the Chinese dish "Re Gan Mian" and associates it with Wuhan City.

To improve model performance, advanced prompt techniques such as step-by-step guidance or few-shot context methods can be considered instead of providing the model with multi-page prompts all at once.

Multilingual and multimodal

: In image description prompts, accepts Chinese, French, and Czech languages and returns corresponding image descriptions in those languages.
: Recognizes scene images containing texts in multiple languages.
: Understands cultural differences and generates appropriate multilingual descriptions for wedding pictures.

Code generation capability

: Generates LaTeX code from handwritten math equations.
: Converts tables in images into Markdown code.
: Demonstrates how to replicate input graphics using Python, TikZ, and SVG.

Time and video understanding

: Can accurately analyze sequences of video frames.
: Measures the model's ability to identify causal relationships and temporal progressions in shuffled images.
: For example, use GPT-4V to predict short-term outcomes of soccer penalty kicks.
: Determines whether the goalkeeper successfully blocked the ball, demonstrating an understanding of causality.

Emotional intelligence testing

: Identifies emotions in facial expressions and provides reasonable emotional explanations.
: Interprets emotions based on content and image style, such as contentment, anger, awe, and fear. This is crucial for applications like home robots.
: Can describe images according to emotional requirements, making image descriptions scarier or more comforting.

Emergence

: Identifies differing regions or components in images.
: Demonstrates GPT-4V's defect detection capabilities on defective product images.
: Combines human detectors with GPT-4V's visual reasoning to identify potential safety hazards.
: Through further development, more complex and practical self-checkout scenarios can be explored, achieving full automation of the checkout process and enhancing customer experience.

ABOUT THE AUTHOR

Renee's Entrepreneurial JourneyEssay Editor

This is my little corner of the internet where I share thoughts, ideas, and interesting stuff I come across in the world of AI. Things in this field move fast, and I use this space to slow down a bit—to reflect, explore, and hopefully spark some good conversations.

Cognosys - AI Automated Workflow

May 20, 2024

OPENAI

ChatGPT multimodal trial

October 10, 2023

PROMPT

Six key points for writing effective Prompts

December 18, 2023

Alibaba's semantic recognition model SenseVoice and voice generation model CosyVoice

July 19, 2024

ENTREPRENEURSHIP

Suitability is more important than being good

June 4, 2022

LLM

AlphaGo and the Power of Reinforcement Learning - Andrej Karpathy's Deep Dive on LLMs (Part 9)

LLMMarch 21, 2025

Reinforcement Learning from Human Feedback (RLHF) - Andrej Karpathy's Deep Dive on LLMs (Part 10)

LLMMarch 22, 2025

The Future of Large Language Models - Andrej Karpathy's In-Depth Explanation of LLM (Part 11)

LLMMarch 23, 2025

GOOGLE

Trial of Google's video generation model VOE2

GOOGLEMarch 23, 2025

Gemini 2.5 Pro, claimed to be far ahead of the competition, has been released with great fanfare: comprehensively surpassing other LLMs and topping the global rankings

GOOGLEMarch 26, 2025

AI-Researcher: LLM-driven全自动 scientific research assistant

GOOGLEMarch 30, 2025

OPENAI

ChatGPT's application capabilities in the visual domain - Advanced Level 3

October 13, 2023

Scene text, table, chart and document reasoning

: Accurately identifies handwritten and printed text within a scene.
: Can identify a right triangle and determine its side lengths AB as 4 units and BC as 3 units.
: Accurately interprets the beginning and end of a proposal process.
: Identifies the Chinese dish "Re Gan Mian" and associates it with Wuhan City.

To improve model performance, advanced prompt techniques such as step-by-step guidance or few-shot context methods can be considered instead of providing the model with multi-page prompts all at once.

Multilingual and multimodal

: In image description prompts, accepts Chinese, French, and Czech languages and returns corresponding image descriptions in those languages.
: Recognizes scene images containing texts in multiple languages.
: Understands cultural differences and generates appropriate multilingual descriptions for wedding pictures.

Code generation capability

: Generates LaTeX code from handwritten math equations.
: Converts tables in images into Markdown code.
: Demonstrates how to replicate input graphics using Python, TikZ, and SVG.

Time and video understanding

: Can accurately analyze sequences of video frames.
: Measures the model's ability to identify causal relationships and temporal progressions in shuffled images.
: For example, use GPT-4V to predict short-term outcomes of soccer penalty kicks.
: Determines whether the goalkeeper successfully blocked the ball, demonstrating an understanding of causality.

Emotional intelligence testing

: Identifies emotions in facial expressions and provides reasonable emotional explanations.
: Interprets emotions based on content and image style, such as contentment, anger, awe, and fear. This is crucial for applications like home robots.
: Can describe images according to emotional requirements, making image descriptions scarier or more comforting.

Emergence

: Identifies differing regions or components in images.
: Demonstrates GPT-4V's defect detection capabilities on defective product images.
: Combines human detectors with GPT-4V's visual reasoning to identify potential safety hazards.
: Through further development, more complex and practical self-checkout scenarios can be explored, achieving full automation of the checkout process and enhancing customer experience.

ABOUT THE AUTHOR

Renee's Entrepreneurial Journey

Essay Editor

LLM

AlphaGo and the Power of Reinforcement Learning - Andrej Karpathy's Deep Dive on LLMs (Part 9)

LLMMarch 21, 2025

Reinforcement Learning from Human Feedback (RLHF) - Andrej Karpathy's Deep Dive on LLMs (Part 10)

LLMMarch 22, 2025

The Future of Large Language Models - Andrej Karpathy's In-Depth Explanation of LLM (Part 11)

LLMMarch 23, 2025

GOOGLE

Trial of Google's video generation model VOE2

GOOGLEMarch 23, 2025

Gemini 2.5 Pro, claimed to be far ahead of the competition, has been released with great fanfare: comprehensively surpassing other LLMs and topping the global rankings

GOOGLEMarch 26, 2025

AI-Researcher: LLM-driven全自动 scientific research assistant

GOOGLEMarch 30, 2025

ChatGPT's application capabilities in the visual domain - Advanced Level 3

Scene text, table, chart and document reasoning

Multilingual and multimodal

Code generation capability

Time and video understanding

Emotional intelligence testing

Emergence

ABOUT THE AUTHOR

RELATED

Cognosys - AI Automated Workflow

ChatGPT multimodal trial

Six key points for writing effective Prompts

Alibaba's semantic recognition model SenseVoice and voice generation model CosyVoice

Suitability is more important than being good

POPULAR

LLM

GOOGLE

ChatGPT's application capabilities in the visual domain - Advanced Level 3

Scene text, table, chart and document reasoning

Multilingual and multimodal

Code generation capability

Time and video understanding

Emotional intelligence testing

Emergence

ABOUT THE AUTHOR

POPULAR

AI TOOLS

RELATED

Cognosys - AI Automated Workflow

ChatGPT multimodal trial

Six key points for writing effective Prompts

Alibaba's semantic recognition model SenseVoice and voice generation model CosyVoice

LLM

GOOGLE