As a preface to this experiment, I’d like to note that the results of my testing here are in no way meant to be a conclusive or in-depth analysis of each language’s performance – rather, these experiments are just a tiny glimpse into how each GPT approaches solving a fairly simple web design problem. Great. Now that that’s out of the way, let’s talk about the project.
Where I’m Coming From
Last night, I found myself tossing and turning at 3AM thinking about some upcoming work I’m doing on an intranet site for one of my clients. I suppose this is the price I pay for lulling myself to sleep by designing javascript functions, but it got me thinking about the performance of various “AI” tools when it comes to design. Like many other developers, I’ve integrated some of these tools (GitHub CoPilot, CodeLLM, Cursor, etc) into my day-to-day work, but I usually limit their contributions to a sort of glorified auto-complete. It’s difficult to deny that at this point, these tools have become particularly good at understanding short snippets of code and iterating on common functions and algorithms.
With that said, I’ve always been cautious of passing off design work to AI – generally, I’ve found their work to be a bit sloppy when it comes to setting up any sort of design or UI elements, probably because they’re completely incapable of “seeing” in the same way us humans do. With that said, a new set of releases from Google and Anthropic in Gemini 2.5 Pro and Claude 4 (respectively) begs me to revisit the question: can AI design a competitive modern design?
The Experiment
To answer this question, I decided to boot up all of the biggest and brightest large language models on the market, handing each of them a copy of the exact same prompt. I’m in need of a small, wholly contained date picker input for a classic ASP (I know, please hold your vomit) project I’ve been making minor tweaks on. Luckily, this seems like the perfect place to start. Each AI model will receive the same prompt (shown below), spitting out any javascript, css, and html necessary to craft my date picker. I’ll exclusively use natural language to fix any immediate bugs, but I’m limiting each AI to only 3 “fixes” past the original design. Once we’ve gone through and analyzed each result, we’ll score them based on functionality, design, code complexity, and expandability.
Create a fully functional, sleek, and modern datepicker input component using only HTML, CSS, and vanilla JavaScript (no external libraries).
The datepicker should:
- Use a clean, minimal UI design with smooth transitions and responsive layout
- Allow selection of dates from a popup calendar UI
- Highlight the current day and selected day clearly
- Support navigating between months and years
- Close the calendar when clicking outside or selecting a date
- Be keyboard-accessible and mobile-friendly
- Include all necessary styles and scripts in the same example, and ensure it's ready to copy and paste into a standalone HTML file.
The Results
GPT 4.1 (OpenAI)
See the Pen GPT4.1 DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
My initial impressions of GPT 4.1’s creation are extremely positive. It checks a lot of my metaphorical boxes, including keyboard accessibility, a clean animation on the popup, CSS variables for styling preferences, and media queries for sizing on mobile or smaller screens. As the most popular AI tool on the market, we’ll use this as the “gold standard” upon which each other AI is judged.
The main issue I have with this implementation is the JavaScript code – in addition to a bug where the selected date would always be one day earlier than the date clicked, the JS is not particularly easy to read, nor is it performant. It uses convoluted variable names, and doesn’t natively support multiple DatePickers on a single page.
Functionality | ★★★★★ |
Design | ★★★★★ |
Complexity | ★★★★ |
Expandability | ★★ |
Claude Sonnet 4 (Anthropic)
See the Pen Sonnet 4 DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
As the de-facto “king” of coding, I had high hopes for Anthropic’s new model. I can’t say it fully lived up to them, but nonetheless we’re left with a functional, flexible datepicker that I daresay reminds me a bit of MudBlazor’s aesthetic choices.
Let’s start with the good – once again, we have a clean modern design with elegant animations, the datepicker worked perfectly on the initial render (other than a font color issue which was easily resolved), and I’m a big fan of the use of classes to encapsulate the logic.
Unfortunately, Claude seemed to only deliver on the initial requirements, failing to go above and beyond in the same way that GPT 4.1 did. Accessibility is poor, as dates can’t be navigated with the arrow keys, and the customization options are slim out of the box. Regardless, a strong JavaScript implementation puts this right up with GPT4.1.
Functionality | ★★★★ |
Design | ★★★★★ |
Complexity | ★★★★ |
Expandability | ★★★ |
Gemini 2.5 Pro (Google)
See the Pen Gemini 2.5 Pro DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
Google’s most recent LLM has caused quite a stir, thanks to it’s ridiculously long context window and quick response times, but seeing as how neither of those are on the docket for evaluation today, I was curious to see how it would stack up to the others.\
Initial impressions place it somewhere in the same ballpark as GPT 4.1 – it shares the same use of CSS variables, as well as a relatively nice design (though I’d rank it lower than the other two we’ve tested so far in this department). It’s main limitations are a painful reliance on document.getElementById, which hinders its expandability significantly, and it’s initialization. Both GPT 4.1 and Gemini have failed to encapsulate the DatePicker logic, making it difficult to extend to multiple instances on a single page.
Functionality | ★★★★ |
Design | ★★★★ |
Complexity | ★★★ |
Expandability | ★★ |
Deepseek R1 (Deepseek)
See the Pen Deepseek R1 DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
In the world of LLMs, few have caused as much of a stir as Deepseek, predominantly due to it’s open-source nature. Unfortunately, in this test, it didn’t perform as well as I’d have hoped, with the main issue stemming from using Opacity: 0 and pointer events to control the interactivity of the component. This allows users to still interact with the dropdown when it isn’t open, leading to wildly unpredictable behavior. Additionally, accessibility was poor, with no way to open the dropdown with keyboard commands.
Outside of that, I have to praise Deepseek for it’s elegant CSS, legible variable names, and class-based approach to JavaScript. If it weren’t for the accessibility and interactivity issues, this would be my top choice.
Functionality | ★★ |
Design | ★★★★★ |
Complexity | ★★★★ |
Expandability | ★★ |
Llama4 Maverick (Meta)
See the Pen Llama4 Maverick DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
Without getting too far into the weeds on software licensing, Llama4 is perhaps best known in the developer sphere for being an “open-source in name only” LLM. In addition to the controversy surrounding Llama’s licensing, it’s performance in our test indicates a much more… shall we say reserved approach to design.
All the accessibility, code complexity, expandability and getElementById issues that plagued earlier examples are present here, and the aesthetic design leaves quite a lot on the table. While I generally appreciate a restrained approach to UI design, this takes it a step too far, and I’d be extremely hesitant to even use this as a starting point.
Functionality | ★★ |
Design | ★ |
Complexity | ★★ |
Expandability | ★★ |
Grok 3 (X/Twitter)
See the Pen Grok 3 DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
Speaking of avoiding controversy, I’ll stay out of the X/Twitter/Musk drama for now, focusing instead on its performance in this test, which was above average. Aesthetically, I’m happy with the results of Grok, and most of the functionality is there. My only gripe is a lack of accessible controls and CSS variables.
All the Javascript functionality is encapsulated in a class (though I’m not a fan of repeated querySelector calls), the datepicking functionality worked right out of the box, and styles are inoffensive while staying modern.
Functionality | ★★★ |
Design | ★★★★★ |
Complexity | ★★★★ |
Expandability | ★★★ |
Qwen 235B A22B (X/Twitter)
See the Pen Qwen 235B A22B DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
Until I wrote this article, Qwen was a bit of an unknown to me – while I’m extremely familiar with Alibaba’s B2B platforms, their tech side has remained largely unexplored. A shame, as their performance here puts them on par with some of the better models, at a much more accessible price.
It checks almost all the same boxes as Grok (with similar weaknesses in accessibility), while including CSS variables and a more clean JavaScript codebase. Barring improvements in Deepseek, I’d place this as one of the best price-to-performance options on the market.
Functionality | ★★★ |
Design | ★★★★★ |
Complexity | ★★★★ |
Expandability | ★★★★ |
o4 Mini High (OpenAI)
See the Pen o4 Mini High DatePicker (RonanArmstrong.com) by Ronan Armstrong (@RonanArmstrong) on CodePen.
Since I’ve heard a lot about o4, my expectations going in were high. To some degree, it delivered, offering accessible controls, animations, and a relatively modern aesthetic appearance, but when compared to GPT 4.1, I found it a bit lacking. It shares a lot of the issues present in it’s (older?) brother, including poor JavaScript expandability.
I was a little unimpressed with the design, which wasn’t as elegant as some of our other contenders, and the lack of class-based encapsulation in the JS knocked it down my rankings to just below Grok and Gemini.
Functionality | ★★★★ |
Design | ★★★★ |
Complexity | ★★★ |
Expandability | ★★ |
Conclusion
If we look at our rankings here, a few things stand out. Both GPT 4.1 and Sonnet excelled in creating modern UI components, but between the two I have to lean towards Sonnet, due to it’s better expandability. To me, adding a more robust accessibility solution to Sonnet would be significantly easier than rewriting GPT 4.1’s JavaScript to be more expandable.
Llama4 pulls a very distant last place, and the remainder of the LLM models fall somewhere in the middle. If you’re planning on using these LLMs through an API connection, I would lean towards Qwen, due to it’s price to performance ratio. It’s close to the two top dogs while coming in at a fraction of the price per million tokens.
While this test was in no means an objective measure of each LLM’s performance, I found it fun to iterate through the various options on the market to see how each one responded to a pretty simple prompt. These days, it feels like we’re seeing a new LLM hit the market every other day, and with these results, I’m confident that we’ll see these models reach a point where they’re able to crank out accessible, maintainable code at an outstanding rate. The developer side of me is certainly pessimistic about how this will impact my career (aren’t we all a little pessimistic these days?), but I can’t say I’m not at least a little bit curious about what the next big revelation will be.
Until then, Ronan