* This blog post is a summary of this video.

Comparing GPT-3 vs GPT-4 AI: Which Generates Better Code, Content and Reasoning?

Author: WordsAtScaleTime: 2024-02-04 18:05:01

Table of Contents

Introduction to GPT-3 vs GPT-4

There has been a lot of buzz lately around the release of GPT-4 and how it compares to its predecessor GPT-3. In this blog post, we will analyze the differences between these two AI models across several tests including code generation, language ability, long-form content creation, and critical thinking evaluation.

We will specifically compare GPT-3, GPT-4 within ChatGPT, and Bing which supposedly uses GPT-4 as well. By running through these series of tests, we aim to determine if GPT-4 represents a significant leap over GPT-3 and also how the free ChatGPT and Bing options stack up.

Overview of GPT-3 vs GPT-4

GPT-3 was released by OpenAI in 2020 as a large language model capable of understanding and generating human-like text. GPT-4 is the latest iteration released in 2022, which is supposedly more capable across most language tasks. There is still limited public information available about GPT-4, but it is expected to be better at tasks like coding, critical thinking, and long-form content generation. The key questions we want to analyze are:

  • How much better is GPT-4 at coding and math problems compared to GPT-3?
  • Can GPT-4 follow complex instructions like using only words starting with the letter 'A'?
  • Does GPT-4 write longer, more original articles?
  • How does GPT-4 perform at analyzing complex technical papers?

Comparing Code Generation Capabilities

One claimed improvement with GPT-4 is its ability to generate code, so we tested it along with GPT-3 and Bing by asking them to create an HTML website. GPT-4 produced a visually appealing site with unique AI jokes while GPT-3 had a plainer design and some duplicate jokes. Bing surprisingly generated broken code with artifacts.

HTML Code Examples and Results

When asked to generate HTML and inline CSS for a jokes website, GPT-3 produced a simplistic design with 3 total jokes. GPT-4 created a visually superior website with 5 unique AI-related jokes. However, the Bing attempt resulted in fragmented code that failed to properly render. This suggests GPT-4 represents a real upgrade over GPT-3 for creative coding tasks, while Bing lacks the robust generative capabilities despite supposedly using GPT-4 itself.

Code Quality Comparison

We also tested GPT-4's code by creating a small multiple choice quiz. This involved generating JavaScript snippets for dynamically showing the quiz questions and options. GPT-4 was able to produce working code for this interactive element within 15 minutes. Again, this demonstrates greater coding prowess than GPT-3 based on faster generation of complex program logic. The output still required minor tweaking but represented an advanced starting point.

Language Ability Tests

Beyond coding, GPT models are judged based on core language understanding across areas like puzzles, mathematics, and story writing. Our tests reveal GPT-4 has achieved new heights of comprehension but Bing fails to match it despite touting GPT-4 integration.

Answering Riddles and Puzzles

We provided a sample riddle to see if the AI models could determine the correct solution. GPT-3 and Bing both incorrectly guessed "horse" while the GPT-4 chatbot solved it properly. This shows superior logical reasoning unavailable before GPT-4.

Solving Math Problems

The math test involved calculating total handshakes between people in a room along with a complex coin puzzle. GPT-3 and Bing came up with the wrong totals, but GPT-4 nailed the solutions by showing its work. This quantitive edge is important for real-world applications.

Story Writing with Constraints

For creative writing, we asked the models to generate a short story about a French bulldog using only words starting with the letter A. GPT-3 failed to follow these constraints while the GPT-4 chatbot abided by the instructions. Once again, Bing could not match this advanced comprehension that comes with GPT-4 integration.

Long-Form Content Generation Comparison

Given GPT-4's stronger coding and languages abilities, does it also write longer and more original articles? Our 2,000 word article test reveals Bing as the surprise winner for quality factual writing while GPT-4 offers superior formatting.

Article Length and Word Counts

When prompted to write a 2,000 word article on crypto mining at home, GPT-3 produced an 800 word piece while Bing and GPT-4 both generated 1,300 words. GPT-4 spent more time loading and had errors, so Bing was faster, but GPT-4 formatted nicer tables of contents.

Originality Scores

We scanned all three articles for originality percentage scores. GPT-3 scored just 2% original content. Bing scored a much higher 69% likely due to citing outside references. GPT-4 landed at 51% originality with its own detailed writing. For long form writing, businesses may prefer Bing for more factual articles with citations, while GPT-4 generates lengthier creative content like stories.

Critical Thinking Evaluation

The final test assessed how GPT-3, GPT-4, and Bing analyze a complex technical paper through identifying strengths, limitations and improvements.

Analyzing a Technical Paper

When provided with an academic introduction on image recognition, GPT-3 highlighted a few high-level pros and cons. GPT-4 delivered a more thorough breakdown discussing lengthy paragraphs and technical jargon. Meanwhile, Bing had significant overlap with GPT-3. This evaluation showcases GPT-4's greater comprehension and critical thinking abilities for assessing advanced writing and research material.

Conclusion and Key Takeaways

Through extensive testing, GPT-4 demonstrates clear improvements over GPT-3 in areas like coding generation, math problems, constrained writing, and critical document analysis. The upgrades are noticeable not only in raw performance, but also in the practicality of using GPT-4 for business use cases.

Bing showcases some complementary strengths around fast factual writing by detecting sources automatically. However, it fails to match GPT-4 in domains needing advanced reasoning. The integration likely utilizes a lower-powered version without full capabilities.

In summary, GPT-4 represents a major leap for generative AI, establishing new benchmarks in comprehension that allow more impactful real-world applications across content, code, analytics and more.


Q: What were the key differences found between GPT-3 and GPT-4?
A: GPT-4 showed clear improvements in code generation, math and reasoning problems, and constrained writing compared to GPT-3. GPT-3 struggled with tasks like writing a story where all words had to start with 'A'.

Q: Which model generated better long-form content?
A: GPT-4 and Bing (likely using GPT-4 tech) produced comparable long-form content, with Bing scoring higher in originality but GPT-4 having a better writing experience.

Q: Did the tests show GPT-4 is a major leap forward?
A: Yes, the side-by-side comparisons revealed noticeable differences in GPT-4's abilities over GPT-3, particularly in math, reasoning, and constrained writing.