By: Diana Guaiña.

This project was conducted as part of the “Careers with Impact” program during the 14-week mentoring phase. You can find more information about the program in this post.

1. Summary

This study evaluated the performance of four LLMs (DeepSeek R1, GPT-4o1 Preview, Claude 3.7 Sonnet and Gemini 2.0) in frontend code generation by applying design patterns such as Factory Method, Observer and Strategy. The methodology used zero-shot prompts with three attempts per request, evaluating correct implementation criteria, common errors and response times. The analysis showed that, although the models achieve success rates above 85%, the findings underscore the importance of combining automation through AI with sound software engineering principles. The proper application of design patterns as a compendium of best practices is crucial to ensure maintainable, scalable and efficient code.

1.1. Introduction

Since the global launch of ChatGPT until today, different models of artificial intelligence (AI) continue to emerge, which facilitate from business processes to the way we learn. In this context, AI exists from the digital world and this is full of web applications, which we call: platforms, websites or web pages, these accompany us in everyday life, for example, when we shop online. Many of the great platforms, those that allow us to create new solutions, emerged from code, and advances in technology have always been aimed at automating processes, generating more and more value and reducing costs. Likewise, we find AI in almost all the web solutions we use; in fact, large companies have integrated AI so that it can be used simply from the language, which allows us to generate countless creations in files, text or images. However, in the technology sector, AI goes beyond words, there are techniques, areas or specialties that have enabled the great advances that we know today and can use.

Web development, meanwhile, is one of the areas that has evolved with the same goal as AI: that everyone can create. Now also code is created through AI from the need to develop fast, the demand in technology is reflected in the constant updates we see in areas of software. However, web solutions that are projected to continue to grow face difficulties because, from the code, good practices were not implemented (GitClear, 2025), likewise the use of AI is linked to 7.2% less stability in deliveries for every 25% of AI adoption (DORA, 2024).

AI models have been the subject of evaluations aimed at analyzing their ability to generate quality code for specific tasks, as exemplified by the HumanEval benchmark (Wang, 2024). In this context, this research explores how automation using AI can influence code generation in web development scenarios, especially by measuring the quality of the code produced. Also, adhering to coding standards is essential, as this not only prevents errors, but also improves the readability of the software, making it easier for developers to interpret and work with the code more efficiently. This aligns with the findings of Sarhan (2019), who highlights that simplifying coding contributes directly to the improvement of the development process. This defines the need to establish a practical benchmark to evaluate the capability of AI in solving requirements in frontend web development.

1.2. Contextualization of the problem

Software development faces challenges that go beyond the emerging fear of whether artificial intelligence (AI) will replace programmers. It is well known that both the technological field and other sectors aim to generate economic value. This brings with it specific challenges in web development, where the practice seeks to optimize processes that ensure code quality, promote efficiency and allow projects to grow with low maintenance costs.

Meanwhile, process automation has facilitated the creation of specific solutions, just as AI has introduced the possibility of coding less and thinking more strategically at the business level. However, the gap on the quality of the generated code poses new challenges in software, this problem is particularly relevant in areas such as the frontend, where there are still open questions about the capabilities and limitations of AI, as key requirements have not been evaluated so far.

Institutions such as Epoch AI collect key data on AI models and their economic implications, promoting evaluations that analyze their impact. GitClear (2025), in its AI Code Quality Research report, addresses economic factors associated with the use of AI assistants and their impact on productivity, maintainability and long-term costs. Likewise, in software engineering, Gamma et al. (1994) emphasize that design patterns are structured tools that allow applying good code practices, creating reusable elements, and facilitating the understanding of teamwork. In this context, this study seeks to generate data on the capabilities of AI models in the generation of frontend code, following guidelines established by design patterns.

The project obtains key information on how LLMs generate code in areas such as the frontend, analyzing their performance through official documentation and standards related to design patterns. Addressing these metrics and characteristics is fundamental to identify opportunities for improvement, promote the adoption of best practices and ensure that the code generated is efficient, scalable and aligned with software requirements. As a result of this evaluation, a benchmark is defined that synthesizes fundamental tasks in frontend development, shows the impact of AI-generated code and analyzes the feasibility of automation. This approach not only facilitates the analysis of LLMs’ performance, but also promotes their alignment towards the adoption of good coding practices.

1.3. Research question

How do different LLMs perform when generating code for frontend web development and applying design patterns according to best practices?

1.4. General Objective

Compare the performance of Large Language Models (LLMs) in generating code for frontend web development and in applying best practices.

1.5. Specific objectives

Determine design pattern evaluation rubrics for benchmarking the performance of LLMs, probing their applicability in code.
Describe case studies for the generation of frontend code in web development, through the definition of requirements and the elaboration of prompts.
Analyze the results of the application of design patterns by LLMs in frontend code generation.
Generate a statistical report on the experimental results of evaluating the responses of LLMs to the applicability of design patterns.

1.6. Personal objectives

The objective of carrying out this study is to encourage the conscious use of artificial intelligence towards knowing the fundamentals in software engineering, since this will strengthen the knowledge base for the effective use of AI, through:

Disseminate knowledge about good coding practices in technology communities and to junior profiles, thus bridging the knowledge and employability gap.
Improve my professional profile by extending this study to more scenarios in software development, thus strengthening my technical skills to apply for opportunities related to my technological profile.

2. Concepts and general criteria involved in research

The code generation is performed from different aspects, however, it is necessary to highlight that some decision criteria are taken into account, such as: Include both American and Asian models worldwide, since it allows to have a wider scope for the analysis of the code, from LLMs with training with other datasets. It is also considered to evaluate the applicability of the design patterns also from the understanding that these compile a series of good practices.

2.1. Design patterns

In particular, the frontend includes the logic that runs on the client side, i.e. inside the browser, while design patterns are mainly used to improve the structure, maintainability and reusability of the user interface code. The applicability of these patterns is analyzed, such as, factory method because it facilitates the creation of objects, observer because it defines a one-to-many dependency, strategy allows to dynamically change the behavior of an object (Shevts, 2024). In other words, design patterns are a collection of good practices encapsulated in steps or recommendations that can be used in the code that, although emerged from software engineering, these fundamentals are applied to web development because they allow addressing problems or requirements to design applications. Table 1 below details the definition for each of the design patterns analyzed in this research.

Table 1
Definition of design patterns

Design pattern	Definition
Factory Method	It belongs to the family of the creative patterns, which allows to provide an interface to create objects in a superclass, while allowing subclasses to alter or modify the type of objects to be created (Shevts, 2024).
Observer	It is a behavioral design pattern that allows the definition of a subscription mechanism. This is possible by defining a one-to-many dependency between objects so that when an object changes state, all its dependents are automatically notified and updated (Gamma, E., et al., 1994).
Strategy	It is a behavioral design pattern that allows defining a family of algorithms, placing each one of them in a separate class and making their objects interchangeable, encapsulates each one of them and makes them interchangeable, allows the algorithm to vary independently of the clients that use it (Freeman et al, 2021). It aims to reduce complexity in conditional logic by encapsulating algorithms in separate classes.

2.2. Context of technologies used in frontend development

In web development, frontend projects are carried out with different technologies, where the main programming language is Javascript, which works from browsers or the web. In frontend development, interactive web pages are created by producing dynamic HTML structures and manipulating the DOM, which allows content to be updated immediately. These tasks include the application of best practices such as the generation of modular code, which facilitates the delegation of functions and improves the organization and maintenance of the software.

It is common that, in formal, large or production projects, frameworks are included for each programming language, as well as for styles or CSS (Cascading Style Sheets), such as React (Vite.js—Next.js), Angular, Vue.js that facilitate the construction of reusable components and the management of complex states, in particular these are used for the family of JavaScript and Typescript, since there are other programming languages used for both the frontend and the backend. It is important to mention that, JavaScript is a programming language known because it was born to be used by the browser, it has support in object-oriented programming (Mozzilla, n.d.).

Javascript so far is used for both frontend and backend development and even in artificial intelligence for training artificial intelligence models. However, for practical purposes that promote the understanding of practices in software development, Vanilla Javascript is chosen, which in other words is not to make use of any framework for programming languages, and seeks to standardize its understanding and applicability to other languages and more complex scenarios in relation to the generation of code by AI in future projects.

2.3. Benchmarks on code evaluation

According to the benchmarks that come closest to the evaluation of code generation, we describe how the evaluation is performed for the following reasons: Meta AI (2024) details the evaluation performed by OpenAI on code generation using HumanEval, which analyzes the quality of the generated code and its alignment with specific tasks. Jain et al. (2024) propose a comprehensive evaluation with LiveCodeBench, addressing different aspects of LLMs’ performance on code tasks. In addition, Jimenez et al. (2024) explore how models can solve real-world problems on GitHub through their SWE-bench benchmark, evaluating the ability of models to address practical problems in software development. They then detail how the evaluation of other benchmarks can be applied to this case study:

Table 2
Benchmark code evaluation benchmarks

Benchmark	Main metric	Code evaluation	Statistical method	Applicable case
HumanEval-Mul	Pass@1	Automatic execution of test cases. The accuracy of the code is evaluated on the first attempt.	Percentage of problems solved. The percentage of problems that the code solves correctly on its first attempt is measured.	Generation of code that implements standard design patterns and validation of its implementation.
LiveCodeBench (COT)	Pass@1-COT	Code execution + TOC evaluation. The code is executed and the reasoning behind the solution is evaluated.	Percentage of correct solutions. The percentage of correct solutions is calculated, considering both the execution of the code and its reasoning.	Analysis of the use of design patterns in the generated code and evaluation of the reasoning behind their implementation.
SWE Verified	Resolved	Automated testing or human review. The evaluation can be done by human review.	Percentage of problems solved. The percentage of problems that have been solved, either by automatic execution or human review, is determined.	Validation of the implementation of design patterns in the generated code, through testing or manual review.

3. Evaluation methodology

For the evaluation of the code generated by LLMs it is considered that design patterns such as Fatory Method, Observer, and Strategy can be evaluated by code reviews identifying the implementation and error identification in order to verify that they fulfill their purpose of improving the quality, maintainability and flexibility of the code. The scenarios for code generation are established according to the evaluation with specific prompts that help to obtain experimental results of LLMs code generation. The following is the flowchart that summarizes the step-by-step methodology of this study:

Figure 1. Flowchart of the evaluation methodology

3.1. Selection of AI models

DeepSeek-R1: It is the latest version available, supports multiple programming languages such as Python, JavaScript, and some others. It allows the generation of code by means of natural language and adapts to both English and Spanish.

GPT-4o1 Preview: It is the version prior to the o1 version of GPT, being a reasoning model for solving difficult problems, useful when tackling complex problems such as coding, mathematics and science (OpenAI, n.d.). It is also available from Visual Studio Code editor through Copilot.

Claude 3.7 sonnet: It is optimized to be resource efficient, with a focus on ethics and security. Due to the constant updates in the versions of the models, it was chosen to consider the model in scope to date.

Gemini 2.0: builds on the success of 1.5 Flash, the most popular model for developers, with improved performance and equally fast response times (Google, 2024).

3.2. Tools and environment for code evaluation

Development environment: Visual Studio Code Editor (VSC) from here the project structure was grouped, with directories (folders) for the generated code, test and installation of tools. Node.js was also used as an execution environment that allows the use of the npm package manager (node package manage).

Libraries used, tools: Jest for the execution of unit tests and babel as Javascript compiler.

Documentation reported in Github: From this platform the project was uploaded with the evaluated code, using Git as version manager, using branches and commits to manage changes during each decision on the evaluation of the experiments. VSC created readme.md files to describe the technical details of this study. Available at: Github—project.

3.3. Rubrics on code evaluation according to design patterns

In this study, a systematic evaluation of the responses generated by the AI models on frontend code is performed. In Annex A, the code evaluation rubric is detailed for each of the analyzed design patterns, three columns are defined where the following is described:

Evaluation criteria: These criteria are defined according to the applicability of the design patterns.

Correct implementation: The definition of the evaluation criteria allows to explain in more detail how the implementation of the design patterns is reflected, in some cases describing examples of when they are correct in the different scenarios.

Description of common errors: The errors described for each scenario are defined when the correct implementation of the different criteria is not achieved. Likewise, coding is used, there are (1), (2) and (3) errors that are identified to identify what type of error exists in the code generation. In the error descriptions, double quotation marks are used to explain the error, however, errors that exist within the analysis of the rubric are evaluated, so it can include more scenarios that may result from the testing or evaluation of the code.

3.4. Definition of the evaluation framework

The methodology to evaluate the code generated by LLMs was a systematic process as shown in Figure 1, the strategy included the generation of test cases that covered scenarios related to the design patterns. The code generation was performed using the zero-shot prompting strategy, which allows generating code from a single request, without the need to add code examples, this process facilitates the analysis in the implementation of the design patterns. Likewise, it was decided to perform three attempts for AI code generation in order to analyze the variability of the LLMs’ responses.

Each of the design patterns was evaluated under different software requirements, first it was analyzed in which cases each pattern is applied. To consider the context in which the LLMs responses are evaluated, 3 criteria were described to measure the responses in each design pattern, which was done from the definition of the evaluation criteria for each pattern. Then the creation of prompts was defined, since in frontend web development it is possible to apply design patterns from different software requirements, finally, it was decided to create prompts focused on the applicability of each pattern. Similarly, the criteria for AI code generation are divided into: experiment number, scenario, requirement and prompt. The scenario presents a general context about what will be requested, the requirement shows in a more specific way the step-by-step for the code construction, while the zero-shot prompt defines the requests to the LLMs.

During the first code requests it was observed that the code responses contained different names of classes and methods with different names and other aspects that affected the uniformity of the expected code for analysis, which is found in Annex B, where the change of this analysis is specified and serves as a reference to understand the structure used. Due to this, it was decided to change the structure of the directed prompt to: generate predefined names, use ES modules since it is used in modern systems, and that for practical purposes it was necessary not to apply modularity with the use of different files, so that the classes are made in a single file to be exported and tested, thus following a similar structure in the prompt to keep the code standardized for each response of the LLMs in relation to each number of experiments.

3.5. Experiment evaluation criteria

Table 3 below describes how the evaluation of the model responses will be carried out according to the evaluation criteria for the design patterns in Appendix A. The evaluation of the experiments is structured according to the following:

Successful cases: Cases in which the model succeeds in solving the problem correctly.
Common errors: These are errors that may occur during attempts.
Attempts allowed: A maximum of 3 attempts are allowed to solve a case. The multiple attempts are intended to analyze how the answers vary between attempts.

Table 3
Description of evaluation metrics

Description	Evaluation criteria	Result	Description of the result
Success rate	Compliance with requirements, represents the remaining percentage that had no errors.	Percentage (%)	Case studies can achieve a success rate, despite having or not having some errors in the implementation of the patterns.
Common errors	Identification of errors in the implementation of design patterns	Number of errors found	Errors are identified by analysis on the implementation of the patterns, scoring on each of the errors in each attempt.
Response time	Average time to generate request response	Seconds	Time it takes LLMs to generate frontend code.

3.5.1. Calculation of success in the implementation of design patterns

To calculate the average success rate, the success rates of each attempt on a prompt request are summed.
Dividing by the number of attempts results in the Average Success Rate (Success Percentage or Success Rate).
The success rate for each attempt is determined according to the error detection, i.e., it is analyzed if the generated code meets or does not meet the requirement and detects errors such as bad code practices under the guidelines of each pattern, from there, there may be a 100% success rate because there is no error, or there may be a 100% error detection according to the total number of errors analyzed in each pattern.
The success rate is defined according to the number of errors detected, and the error rate is distributed according to the total number of errors. It is determined that, if there are 2 errors at the first attempt, and 1 error at the second attempt, and the last attempt there are no errors, it can be seen as follows:
Attempt-1 is the success rate of the first attempt (33.33%).
Attempt-2 is the success rate of the second attempt (66.67%).
Attempt-3 is the success rate of the third attempt (100%).

4. Results and discussion

From here, the results generated in the experiments for each pattern are shown. Thus, Table 4 shows a detailed analysis of the performance of the language models (LLMs) in the implementation of the Factory Method pattern, where it is highlighted that DeepSeek R1 obtained the highest success rate (96%) with a relatively low standard deviation (0.11), suggesting consistency in its performance. However, its average response time (29 seconds) is significantly longer than the Gemini 2.0 model, which recorded an average response time of 7.16 seconds, albeit with a lower success rate (85%). This contrast between accuracy and speed of LLM responses: models such as DeepSeek R1 prioritize code quality at the expense of time, while Gemini 2.0 offers fast responses, but with greater variability in results. In the analysis of errors in the implementation, error type 2 (“Incorrect type of returned product”) were common in Gemini 2.0, indicating that it generated products that did not correspond to the type requested in the input, detecting switch code or multiple case statements since these are considered rigid structures and difficult to maintain, increasing the costs associated with debugging and runtime failures. On the other hand, DeepSeek R1 managed to correctly implement the criterion (“Correct type of returned product”), minimizing critical errors. Furthermore, as shown in Figure 2, the confidence intervals (CI) indicate that success rates are not uniform, reinforcing the importance of structuring prompts carefully to optimize performance.

A graph of blue rectangular bars with red points

AI-generated content may be incorrect.

Figure 2. Success rate of LLMs of the Factory Method design pattern. The Y-axis is the average success rate on correct implementation, and the X-axis is the LLMs analyzed. The bars indicate the average success rate of each model, and the error lines represent the 95% confidence interval, calculated from the standard deviation of the average success rate of the experimental results.

Table 4.
Comparative results of the LLMs on the Factory Method design pattern.

LLMs	Success rate	Standard deviation (success)	Lower IC	IC Superior	Error bar	Response Time	Standard deviation (time)
Claude 3.7 sonnet	0,89	0,16	0,81	0,96	0,07
DeepSeek R1	0,96	0,11	0,91	1,01	0,05	29,00	20,93
GPT4 o1 preview	0,93	0,14	0,86	0,99	0,07
Gemini 2.0	0,85	0,17	0,77	0,93	0,08	7,16	2,68

Figure 3 and Table 5 present an evaluation of the Observer pattern, where Gemini 2.0 stands out with a success rate of 94%, followed by Claude 3.7 Sonnet with 91%. However, DeepSeek R1 shows a performance of (72%), which could be attributed to the evaluated case studies on frontend scenarios. Whereas, the type error (“Memory leaks”) obtained with DeepSeek R1, indicate that non-removed observers continue to occupy memory, which reduces the scalability of the systems and increases maintenance costs. In terms of response time, Gemini 2.0 again leads with an average time of 5.68 seconds, reinforcing its position as the fastest model, although with a high standard deviation (2.91), indicating some inconsistency in its performance.

A graph of blue rectangular objects with red lines

AI-generated content may be incorrect.

Figure 3. Bar chart on the experimental results of the Observer pattern. The Y-axis is the average success rate on correct implementation, and the X-axis is the LLMs analyzed. The bars indicate the average success rate of each pattern, and the error lines represent the 95% confidence interval, calculated from the standard deviation of the average success rate of the experimental results.

Table 5
Comparative results of the LLMs on the Observer design pattern.

LLMs	Success rate	Standard deviation (success)	Lower IC	IC Superior	Error bar	Response Time	Standard deviation (time)
Claude 3.7 sonnet	0,91	0,15	0,84	0,98	0,07
DeepSeek R1	0,72	0,24	0,61	0,83	0,11	34,11	25,97
GPT4 o1 preview	0,78	0,16	0,70	0,85	0,07
Gemini 2.0	0,94	0,13	0,89	1,00	0,06	5,68	2,91

Table 6 analyzes the performance of LLMs in the Strategy pattern, where Claude 3.7 Sonnet and GPT-4 o1 Preview achieved success rates (100%) with zero standard deviations, indicating consistent performance. This suggests that these models are particularly effective at encapsulating algorithms and delegating specific behaviors analyzed in this study, two key aspects of the Strategy pattern. On the other hand, DeepSeek R1 and Gemini 2.0 showed lower success rates 85% and 91%, respectively, this due to errors detected in the implementation of the code on dynamic strategy swaps and in the use of conditionals in the main logic, context or strategy. In terms of response time, DeepSeek R1 averaged 31.61 seconds, while Gemini 2.0 registered 7.29 seconds, maintaining its trend of being the fastest model. As shown in Figure 4, the results obtained validate the effectiveness of the Strategy pattern in reducing redundancies and improving system flexibility, provided that the LLMs are used correctly.

Table 6
Comparative result of the LLMs on the Strategy design pattern.

LLMs	Success rate	Standard deviation (success)	Lower IC	IC Superior	Error bar	Response Time	Standard deviation (time)
Claude 3.7 sonnet	1,00	0,00	1,00	1,00	0,00
DeepSeek R1	0,85	0,17	0,77	0,93	0,08	31,61	15,89
GPT4 o1 preview	1,00	0,00	1,00	1,00	0,00
Gemini 2.0	0,91	0,15	0,84	0,98	0,07	7,29	2,67

A graph with blue bars and red text

AI-generated content may be incorrect.

Figure 4. Bar chart on experimental results of the Strategy pattern. The Y-axis is the average success rate on correct implementation, and the X-axis is the LLMs analyzed. The bars indicate the average success rate of each model, and the error lines represent the 95% confidence interval, calculated from the standard deviation of the average success rate of the experimental results.

Table 7 provides insight into the performance of LLMs in the three design patterns evaluated. The Strategy pattern obtained the highest success rate (94%) with a low standard deviation (0.08), suggesting that LLMs have the ability to encapsulate logic and minimize errors. On the other hand, the Observer pattern showed the lowest success rate (84%) along with a higher standard deviation (0.17), this suggests challenges in efficiently managing notifications and resource usage in internal code processes. The Factory Method pattern had an intermediate performance (90%), standing out for its balance between success and stability. In terms of response time, as shown in Figure 6, the three patterns presented similar times, with the Observer pattern registering the highest average (19.90 seconds). In Figure 5, these results validate the applicability of design patterns to improve code quality, although they also highlight the need to adjust LLMs according to the context and specific requirements of each pattern.

Table 7
Comparative results of LLMs on each design pattern

Pattern	Success rate	Standard deviation (success)	Lower IC	IC Superior	Error bar	Response Time	Standard deviation (time)
Factory Method	0,91	0,15	0,84	0,97	0,07	18,08	11,81
Observer	0,84	0,17	0,76	0,92	0,08	19,90	14,44
Strategy	0,94	0,08	0,90	0,98	0,04	19,45	9,28

Figure 5. Graph on the success rate and error bars of the design pattern. The Y-axis represents the average success rate, while the X-axis indicates the different design patterns analyzed. The error bars represent the 95% confidence interval, calculated from the standard deviation of the average success rate of the experimental results.

A chart of different colored squares

AI-generated content may be incorrect.

Figure 6. Average response time of the Gemini 2.0 and Deepseek R1 LLMs. The Y-axis represents the response time in seconds, while the X-axis indicates the different LLMs. The boxplots illustrate the variability in response times for the Factory Method, Observer and Strategy design patterns, highlighting the median, quartiles and outliers for each model and pattern combination.

4.1. Discussion on AI code generation

The analysis conducted in this study on the implementation of design patterns such as Factory Method, Observer and Strategy using LLMs reveals key findings that impact both code quality and associated economic costs, as described in the contextualization of the problem. According to GitClear (2025) AI has significantly increased the amount of code generated, e.g., lines added on Github increased from 39.2% in 2020 to 46.2% in 2024, however, this increase is associated with a decrease in refactoring (or modified code) and an increase in lines of code added by AI.

Jiménez et al. (2024) analyzed the ability of the models to solve real-world problems through their SWE-benchmark, evaluating the practical applicability of LLMs in software development. This highlights the importance of testing the models in real-world scenarios, where code duplication and lack of refactoring are critical factors, so the findings of this study, as well as the existence of implementation errors in Gemini 2.0 and DeepSeek R1 in specific tasks in frontend development, coincide with the need to evaluate the models in practical and complex contexts to ensure their effectiveness.

The LLMs evaluated (Claude 3.7 Sonnet, DeepSeek R1, GPT-4 o1 Preview and Gemini 2.0) showed average success rates above 85% in generating code based on design patterns, according to Tables 4, 5 and 6. However, the variability in response times and the need for well-structured prompts underscore the need for human intervention to optimize performance. This finding suggests that, although LLMs speed up code generation, their indiscriminate use may increase the defect rate and, hence, the post-release remediation costs of web solutions, in this context, the application of design patterns emerges as a strategy to mitigate these risks.

4.1.1. Analysis of the applicability of design patterns

The results of LLMs versus the Factory method pattern highlighted the ability to apply decoupling between creation and usage on object creation, which significantly reduces maintenance costs, minimizes runtime errors and facilitates the scalability of the system or web solutions. For example, DeepSeek R1 showed a 96% success rate, validating its effectiveness in environments where flexibility is critical.

The Observer pattern, which defines the efficient management of notifications and the dynamic elimination of observers, are key aspects for reducing resource consumption in the systems created. Although Gemini 2.0 showed lower response times (5.68 seconds on average), its high standard deviation (2.91) indicates variability in its performance. This suggests that, although LLMs can generate code quickly, resource optimization in code implementation remains a challenge.

The results of LLMs versus the Strategy pattern demonstrated the ability to encapsulate algorithms and allow their exchange at runtime. The results show an average success rate of 94%, with an adjusted confidence interval (0.90-0.98). The average success rate analyzed in each pattern highlights its economic value by reducing redundancy and facilitating adaptation to new requirements.

4.1.2. On applying best practices in requirements and in the code

This study allowed analyzing the feasibility of automation to develop frontend code from requirements such as creating web solutions with specific features and understanding good code practices from generated prompts, so that it is maintainable in the long term. Structured automation in code generation, combined with the application of design patterns, has a positive impact on economic costs by minimizing early errors.

According to Wei (2024), LLMs have a significant potential to automate code generation as long as there are well-structured requirements for the purpose of refining functional requirements, designing object-oriented models, generating tests and code. Thus, this study emphasizes the need to know the fundamentals of software engineering since, although LLMs accelerate development, human collaboration is still crucial to validate and adjust the results obtained from the implementation of AI-generated code, both in early stages of analysis and design of the systems, as well as in coding.

Therefore, this study suggests that the application of design patterns boosts the efficiency of code generation through AI, improving software modularity, maintainability and scalability, resulting in resource optimization and reduced development costs. LLMs offer a significant advantage in terms of productivity, their use should be guided by sound design principles to avoid hidden costs associated with technical debt. Compared to previous research, this study highlights how the combination of AI and best practices can maximize long-term economic efficiency, ensuring robust and adaptable systems.

5. Perspectives

From this study it is possible to investigate complex scenarios involving various aspects in the frontend, from the complexity required by the solutions, analyzing the results of LLMs for compliance with accessibility principles on the web, thus obtaining metrics on visual design, functionality and accessibility standards. As a result of the focus on other areas, it is sought that developers can interact with these models to make adjustments that drive better collaboration between developers and the use of AI, as well as the optimization of collaborative tasks in multidisciplinary teams. It is advisable to analyze the performance of solutions closer to reality to assess the capabilities of the models in complex contexts, adding evaluation metrics that include factors such as code readability or adaptability to different frontend frameworks.

This research can be replicated by carefully verifying the necessary versions of the tools used, ensuring their compatibility with the objectives of the new assessments. The structure of the project is designed to facilitate the understanding of each step, providing a clear framework for its execution. However, some important adjustments can be made to enrich the case studies, for example, the incorporation of files in json format for the presentation of prompts, would be a significant improvement, since these not only allow a more structured and efficient reading of the data, but also facilitate integration with various tools and development environments. Also, include other types of testing or analysis in addition to manual review and unit testing, such as architectural analysis, regression testing and integration testing, along with evaluation criteria based on best practices and application of design patterns. Additionally, it would be advisable to include detailed documentation on the use of the different files of new evaluation projects and their connection with the selected tools, with the objective of extending the replicability of the projects by making the knowledge accessible to different developer profiles.

Likewise, it is essential that technology communities oriented to women and junior developers actively participate in research that contributes to reduce the existing knowledge gap of those who are just starting out, as well as to encourage the participation of women in technology with opportunities for employability and entrepreneurship. The disparity between junior and senior developers is evident, especially in a context where employability opportunities are increasing along with technological demands. Therefore, it is crucial to design learning experiences and development projects that allow juniors to acquire the necessary skills to meet the current demands of the labor market, thus fostering their professional growth and integration in the technological field.

6. References

GitClear (2025). AI Copilot Code Quality Evaluating 2024′s Increased Defect Rate via Code Quality Metrics. https://www.gitclear.com/ai_assistant_code_quality_2025_research

DevOps Research and Assessment (DORA). (2024). 2024 Final DORA Report. https://services.google.com/fh/files/misc/2024_final_dora_report.pdf

Sarhan, Q. I. (2019). Best Practices and Recommendations for Writing Good Software. Journal of University of Duhok , 22(1), 90-105. https://doi.org/10.26682/sjuod.2019.22.1.11

Tong, W. & Zhang, T. (2024). Codejudge: Evaluating Code Generation with Large Language Models. https://aclanthology.org/2024.emnlp-main.1118.pdf

Barbero, A. (2024). An evaluation of LLM code generation capabilities through graded exercises. https://arxiv.

Wei, Bingyang. (2024). Requirements are All You Need: From Requirements to Code with LLMs. https://arxiv.org/html/2406.10101v1

Freeman, E., Robson E., Sierra, K., Bates, B. (2021). Head first, design patterns, Building Extensible & Maintainable Object-Oriented Software. Second edition.

Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1994). Design patterns, elements of reusable object-oriented software. https://www.javier8a.com/itc/bd1/articulo.pdf

Liu, Y. (March 15, 2023). Prompt Engineering, Guide, guides, prompts advanced usage. https://github.com/dair-ai/Prompt-Engineering-Guide/blob/main/guides/prompts-advanced-usage.md#few-shot-prompting

Mozzilla (n.d.). What is Javascript? https://developer.mozilla.org/es/docs/Web/JavaScript

OpenAI (n.d.). Use of OpenAI o1 models and GPT-4o models in ChatGPT https://help.openai.com/en/articles/9824965-using-openai-o1-models-and-gpt-4o-models-on-chatgpt.

Shevts, A. (2024). Dive into design patterns. https://refactoring.guru/es/design-patterns.

Ghosh, et.al. (2024). Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review. https://arxiv.org/html/2406.12655v1

Wang, Z. (June 27, 2024). HumanEval: cracking the LLM benchmark for code generation. https://deepgram.com/learn/humaneval-llm-benchmark#the-humaneval-dataset

OpenAI. (2021). Code for the paper “Evaluating Large Language Models Trained on Code”. https://arxiv.org/abs/2107.03374

AI Multiple (November 1, 2024). AI Coding Benchmark: Best AI Coders Based on 5 Criteria. https://research.aimultiple.com/ai-coding-benchmark/#methodology

Google (2024). Introducing Gemini 2.0. https://blog.google/intl/es-es/productos/tecnologia/presentamos-gemini-20-nuestro-nuevo-modelo-de-inteligencia-artificial-para-la-era-de-la-agentica/

Goal AI. (2024). Code Generation on HumanEval. https://paperswithcode.com/sota/code-generation-on-humaneval

Jain, N., Han, K., Gu, A., Li, W., Yan, F., Zhang, T., Wang, S., Solar, Sen, K., and Stoica, T. (2024). LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. https://github.com/LiveCodeBench/LiveCodeBench.

Jimenez, C, Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. (2024). Can Language Models Resolve Real-world Github Issues? https://github.com/swe-bench/SWE-bench

Annexes

Appendix A. Table of code evaluation rubric on the applicability of design patterns.

Pattern

Evaluation criteria

Correct implementation

Description of common errors

Factory Method

1. Decoupling creation and use

Correct type of product returned

3. Compliance with DIP (Dependency Inversion Principle)

1. Use of factory methods to create objects.

2. The factory returns a product type that corresponds to the specified input, and that uses clean, scalable code.

3. The factory returns abstractions (interfaces), not concrete classes.

(1) “No separate object creation logic”: The client uses or instantiates new directly.

(2) “Incorrect type of returned product”: The returned product does not correspond to the type requested in the input. “Dependency on switch or multiple case statements”.

(3) “Does not use abstractions/does not abstract enough”: The factory returns concrete classes.

Observer

1. Dynamic observer registration

Efficient notifications

3. Resource efficiency

1. Present and functional subscribe/unsubscribe methods.

2. The subject reports only relevant changes.

3. When using unsubscribe, deleted observers do not maintain active references or consume additional resources.

(1) “Missing unsubscribe”: Observers are not deleted correctly causing memory leaks.

(2) “Unnecessary global notifications.”

(3) “Memory leaks”: Non-deleted observers continue to occupy memory and resources in the system.

Strategy

1. Encapsulation of algorithms

2. Runtime exchange

3. Delegates the specific logic of behavior to concrete strategies.

1. Strategies as objects with common method. Ex: execute().

2. The context allows changing the strategy with setStrategy().

3. The selection of strategies occurs outside the main logic, based on mappings or configurations.

(1) “Conditionals instead of strategies”. Use of if or switch to choose behaviors.

(2) “Fixed strategy”: The strategy cannot be changed.

(3) “Conditional in context”. Does not delegate rendering responsibility to each strategy. Use of conditionals within the main logic.

During the first experiment it was difficult to analyze the code generated by the LLMs due to its particularity for not following conventions for the creation of variable and function names, and also that each time it was requested through the prompt the LLMs generated different variables, classes and function names because it was necessary to unify this part. Another decision was to use software engineering to describe the requirements, which allow to explain what the prompt will be about and allows the connection with the scenario (system or web site), thus going from a requirement that is told through the step by step. The decisions made are shown in the following table:

Annex B. Analysis of the first case for the factory method pattern.

No.	Scenario	Requirement	Prompt
1	A system that allows building reusable and dynamic components on articles, products and profiles.	A solution to create card components for different types of content (articles, products, profiles). The factory must receive a type of content and data, and return the corresponding card component.	Generate JavaScript code for a frontend solution, applying the factory method design pattern, to create card components for different types of content (articles, products, profiles). The factory should receive a content type and data, and return the corresponding card component.
1	A system for building reusable card components for articles, products and profiles.	- Implement a factory method to generate dynamic cards according to the type of content. - The factory should receive a type and data and return the corresponding component. - It should allow the addition of new types without modifying the existing logic.	Create a factory method CardFactory.createCard(type, data), which returns ArticleCard, ProductCard or ProfileCard, all inheriting from BaseCard. Implement render(). Define all classes and the factory in a single file and export them. Use ES module syntax (export).

April 2025