The text "gpt-40" with the "4o" marked through with paint and "o1" written in.

September 13, 2024

Hands-on with OpenAI’s New “Reasoning” Model, GPT o1

In theory gpt-o1 can explore options for solving problems, refine them, and identify mistakes in its own "thinking."

By Daniel Detlaf

One-man flea circus, writer, sci-fi nerd, news junkie and AI tinkerer.

Pssst. Would you like a quick weekly dose of AI news, tools and tips to your inbox? Sign up for our newsletter, AIn't Got The Time.

The rumored “Strawberry” model from OpenAI has been released in beta for ChatGPT Plus and Tier 5 API customers. The new line is called “o1“, a new type of AI models trained for multi-step reasoning. According to the release announcement, it is intended to tackle tough problems in sciences and coding.

The o1 series (o1 and o1-mini) takes a new approach to AI reasoning. These models are trained to spend more time processing information before responding, mimicking human thought patterns. In theory gpt-o1 can explore options for solving problems, refine them, and identify mistakes in its own “thinking.” The result is better problem solving.

is GPT o1 smart?

In benchmark tests, the upcoming o1 model looks pretty spiffy. Don’t they always, though?

Allegedly it performs on par with Ph.D. students in scientific disciplines and shows big improvements in math. For example, on a qualifying exam for the International Mathematics Olympiad, the o1 model correctly solved 83% of problems, compared to GPT-4o’s 13%. Its coding abilities have also been evaluated in contests, reaching the 89th percentile in Codeforces competitions.

The model has been trained on an all new dataset tailored to the new multi-step approach. More importantly, it has been extensively trained using RLHF (Reinforcement Learning from Human Feedback).

Tantalizingly, one of the researchers suggested that the limits to performance improvement through RLHF seem to scale differently than the (rapidly diminishing) improvements from model pre-training. In other words, maybe they can keep improving it through RLHF indefinitely.

There is aso a new safety approach that supposedly makes o1 more resistant to jailbreaking attempts, with the o1-preview model scoring 84 out of 100 on OpenAI’s toughest jailbreaking test, compared to GPT-4’s score of 22.

What’s the CHATGPT o1 preview like?

If you’ve used Advanced Data Analysis (formerly Code Interpreter) in ChatGPT, you’ll be familiar with the little dropdown boxes to show the analysis steps the model went through and any code it wrote. The o1 preview operates similarly, showing its chain of thought beneath a message giving the amount of time it spent thinking.

It is moderately agentic. If you have tried AutoGPT or similar agentic frameworks, it’s a bit like that, but cleaner.

There are new “reasoning tokens” in the API documentation. These represent the model making many small API calls for each step of its reasoning process. It sounds like implementation in the API could be a little complicated and costly to boot. GPT-4o is about 4x times cheaper than o1.

Reasoning Test

I was asking ChatGPT to help me identify demographic correlations to advertise an affiliate product, Prodentim. [If you purchase through that link we will get a commission, thanks for your support!] The ads aren’t intended for The Servitor, but if you happen to be a developer with dental problems…

I have a test account at Rumble Advertising Center where I’m doing some small scale pay-per-click advertising experiments. Any sort of supplement or nutritional product is expensive to advertise–the markets are highly competitive. I was asking GPT to help me find less expensive keyword categories that might cross over to the target demographic.

ChatGPT o1's reasoning when presented with my demographic correlation question:
Building the list

I'm piecing together a list of 10 indirect keyword categories for advertising Prodentim, focusing on the target demographic's characteristics and avoiding higher-cost sectors.

Identifying indirect keywords

First, I’m tasked with listing 10 indirect keyword categories for a middle-aged, conservative male audience, avoiding costly sectors. These categories should align with the demographic's interests and political leanings.

Generating appropriate categories

I'm working through correlated categories for ProDentim, focusing on characteristics like pro-GOP and MAGA supporters, while ensuring content aligns with ethical guidelines.

Identifying target demographics

I’m thinking through a middle-aged+ demographic, primarily male, American, conservative, pro-GOP, MAGA, Trump supporters. The focus is on dental health-related interests.

Navigating interests

I'm thinking through interests like DIY home improvement, classic cars, outdoor activities, financial planning, men's health, veterans' support, and potentially conservative news. Conservatism is nuanced, encompassing various ideologies and actions.

Enumerating categories

I’m working through indirect keyword categories, spanning varied interests like DIY projects, outdoor activities, classic cars, men's health, financial planning, veterans' support, barbecue, gardening, sports, and patriotic events. Each category offers unique keyword expansion potential.

Crafting keyword strategies

Mapping out 10 keyword categories that will effectively target the demographic based on their interests and hobbies, ensuring relevance and utility. — The ChatGPT o1 preview showing its reasoning in response to a question about keywords and demographic ad targeting. Screenshot: Daniel Detlaf

O1 demonstrates its thought process in a series of steps before providing its final output. When given a task to generate indirect keyword categories for targeting a specific demographic, the model broke down its approach into several stages.

The model first identified the key components of the keyword task, including the target demographic characteristics and the need for indirect, cost-effective keyword categories.

It considered the given demographic – middle-aged, predominantly male, white American conservatives with strong GOP and MAGA affiliations. It touched on interest topics like DIY home improvement, classic cars, outdoor activities, and financial planning.

Model output:
Certainly! Here are 10 categories/types of indirect keywords that could potentially expand your advertising reach to the target demographic:

Men's Health Supplements
DIY Home Improvement
Outdoor Activities (Hunting, Fishing)
Classic Cars and Automotive Restoration
Barbecue and Grilling Techniques
Financial Planning for Retirement
Patriotic Merchandise and Events
Golf Equipment and Tips
Veteran Support Communities
Gardening and Landscaping
These categories align with the interests and hobbies commonly associated with middle-aged, conservative American males and can help you reach your target audience indirectly without entering highly competitive or expensive advertising sectors. — The model’s final output. Screenshot: Daniel Detlaf

The model’s final output was a list of ten indirect keyword categories. These ranged from men’s health supplements (not very helpful) to patriotic merchandise and veteran support communities. It was… ok. Not mind-blowing, but not bad.

Coming Soon to a ChatGPT Near You

Access to these new models is being rolled out gradually. ChatGPT Plus and Team users can now access o1 models, while API access is limited to qualified developers (those who have spent $1000+ on the OpenAI API).

The new approach by o1 provides a window into the AI’s decision-making process. It’s just a bit more transparent than previous models, showing each step of its reasoning before arriving at the final answer.

It’s neat, but I have questions about what to do with this expanded thought process and whether it actually improves the quality of the output. It may be a model in search of a use case.

Hands-on with OpenAI’s New “Reasoning” Model, GPT o1

is GPT o1 smart?

What’s the CHATGPT o1 preview like?

Reasoning Test

Coming Soon to a ChatGPT Near You

Recent Posts