Github Copilot and Code Quality. How to lie with statistics

Presenting my bias, when analyzing this article:

I've been exploring the topic of using AI tools for software development tasks for some time now, and my stance on the hype is quite clear:

Including commenting on the use of these tools, and their apparent long-term impact. This post is a continuation of what I am already writing.

Github Copilot and Code Quality. How to lie with statistics

Github Copilot is perhaps the main tool today when When it comes to code assistants , the weight of Microsoft, with its peculiar product approach , has certainly increased the hype of the tool and probably if you are in a corporate context, they have already knocked on your door offering AI within the Visual Studio Package, or Office 365.

At a really expensive price, around US$400 per year per dev, the least that is expected is that the company presents consistent results and information about the efficiency of using the tool.

This is what the latest article " Does Github Copilot Increase Code Quality? This is What the Data Says " aims to demonstrate, with an analysis of the positive impact on code quality that using copilot can bring to your team.

The article is terrible, and the study is extremely biased and lacking in criteria. It seems like it was written by an LLM with the prompt: "create a study on code quality that speaks well of the copilot". I expected more rigor from the company.

First rule of a tech person:

Companies do not disclose research with bad results about their products. There is no incentive to disclose information that could negatively impact the sales of their products.

Separating technology marketing from relevant information is an important skill in the world of tool sales.

The Study - Code Quality?

According to the article itself:

"Results in our latest study show that the quality of code written with Github Copilot is significantly more functional, readable, reliable, maintainable and concise"

This study was probably done for an Executive or Technology Leader without technical knowledge, and it can easily convey that we are facing a quantum leap in improvement. At first glance, expectations are high and the possibility of performance is the eternal search for the market .

The results presented are interesting, but they do not support a minimally detailed analysis.

GitHub conducted an experiment with 243 developers, split into two groups: one using Copilot and one not. Key findings include:

Developers using Copilot are 56% more likely to have all tests passing
Using Copilot you can add 13% more code before introducing any code smells
Using Copilot the code is 3.62% more readable , 2.94% more reliable, 2.47% more maintainable, and 4.16% more concise;
PRs from Copilot users are 5% more likely to be approved without comments

What my skeptical mind sees in these statements:

Having been in the market for so long, I can say with 99.99% certainty that these metrics mean absolutely nothing. Zero, niente, rien, 白 - and should not be used in this context!

Why does the graph above also have the percentages wrong? Was it Bing who made it? 62.2% + 39.2% = 100% ??!?! What do you mean?!

Furthermore, without considering the context and meaning, a 2.94% increase in reliability seems to me like a rounding error! This kind of percentage is paltry and meaningless: "I'm 2% fatter today, I had a Coke at lunch"

What is the technical challenge assessed?

It's hard to believe that the company only had 243 individuals to do this research - an infinitesimally small number to get any sense of the complexity of software development work and its daily interactions. But it's interesting to know what the challenge was when using Copilot:

Write code for an API representing an restaurant review APP. Each group completed the exercise with 10 unit tests to validate the functionality.

Along with building a TODO List app, this is perhaps the most well-known interview challenge in the world, and far removed from the day-to-day work of development (Although I believe that CRUD is the majority of applications today).

If you train your own neural network on your phone, you might be able to answer this challenge - without any modern LLM.

How many cases can these 10 unit tests cover for an API? What is the real complexity and relevance of this endpoint covered by tests?

What is Code Quality?

Interesting information to the text: Initially it is said that Copilot can add up to 13% more code before introducing code smell. Looking at the graph, we are talking about 2 lines of code. Statistically relevant (according to them), but practically useless.

Where the article clearly goes off track is when defining the quality and error assessment for the test:

In this study, we define Code Errors as any code that reduces the ability of that code to be easily understood. This does not include functional errors that can prevent the code from working, but rather errors that represent bad coding practices.

In other words: generating code that doesn't work was not the criterion of the study . After all, it is not important for software quality. And instead, they analyzed

Inconsistent names, obtuse identifiers, excessive line lengths, excessive whitespace, missing documentation, repeated code, loop depth, insufficient separation of functionality, and varying complexity.

The problem with these items is that they don't mean much without the context of the technology, experience of those involved, familiarity with the language. And none of these items evaluated have a very clear and definitive definition for any programming language.

What is duplication for Rust, is not necessarily the same for Java. What seems absurd to understand for those who do not know C++ and matrix manipulation, is the day-to-day life of those who work with Arduino and IoT.

An example:

Without the context or knowledge, this line of code is not good enough. But for someone who is studying OpenCV and YOLO, it is an easy line to understand.

cv2.putText(img, f"{class_name} {confidence:.2f}", vertex1, cv2.FONT_HERSHEY_SIMPLEX, 1, colour, 2,)

Every team needs to create their own coding conventions to give meaning to what Code Quality is. Whether in C, Javascript, Rust or Typescript, context and experience are infinitely more important.

Presenting a statistic with the assumption that the use of a tool brings significant gains, while showing a 3% improvement in subjective metrics is at the very least questionable, and makes me wonder if this article is not aimed at wrongly persuading people who do not understand the slightest about Technology, but who can sign the check with the expectation of performance gains.

In one of my last articles, I already presented much more detailed results of using LLMs as coding assistants, and the trend we are seeing of increased complexity, code churn and potential reduction in quality. It is worth reading.

Jadarma makes an even more detailed analysis of the various flaws in this "study", going further and demonstrating how, from the choice of the study group, to the statistical calculation of this news, everything is wrong.

I expected more.

Github Copilot and Code Quality. How to lie with statistics

Github Copilot and Code Quality. How to lie with statistics

The Study - Code Quality?

What is the technical challenge assessed?

What is Code Quality?

Recent Posts

Comments