Now Reading
How reliable are benchmarks and extensive hardware tests?
Contents

In reviews, the reliability of benchmarks of smartphones, CPUs, GPUs and other components has been questioned for decades, yet many people draw on that very data before buying a new product. In this article, we want to try to explore this in more depth, putting forward our points of view in the usual way: simple and without unnecessary technicalities. We will also deal with tests on other hardware components, and not just benchmark software. This is going to be a very long article, for tea with biscuits.

No benchmark is above suspicion

A benchmark is a software programme that carries out a series of tests to measure the performance of a product. This product can be a component (CPU, GPU, HDD...), the whole system (notebook, smartphone...) but also software. For more information, please visit the English page of Wikipedia. In common parlance, we often also call benchmarks what are tests of a different nature, such as tests on power supplies or tests on monitors. There is no such thing as a trustworthy benchmark in any field. Even economic and financial ones, which have far more impressive headlights shining against them, turn out to be unreliable

Interference

The most popular benchmarking software (and platforms) belong to private individuals and even those recognised as 'verifiable' are presided over or participated in by brands with direct interests. There are many ways in which benchmarks can be influenced, but none of them will return incorrect results. E a correct result is not necessarily true. Put simply: according to such-and-such's benchmark, a bottle of wine can fill six glasses. However, the type of glass and its capacity is decided by such-and-such and can differ greatly from our glasses. The sentence is therefore correct but not true.

The bottom... of truth

Over the years, several benchmarks have come under criticism for a lack of objectivity. Recent cases include the class action, almost 15 years long, against Intel (and HP), accused of falsifying WebMark and SYSmark results. Intel did not admit guilt, but preferred to settle the matter with some money, after having committed a fortune in lawyers. It seems like a no-brainer, where there is no admission and where AMD has not even tried to claim compensation. But anyone who has owned a Pentium 4 and an Athlon knows that the Athlons ran better. Even the Pentium 3s ran better than the 4s, for that matter.

Repetita SYSmark

For SYSmark it would seem to be an incurable vice, at least according to AMD. SYSmark is developed by a non-profit consortium consisting mainly of Intel's partners, as well as Intel itself. Among the members we also find a major test, Cnetwhich in the collective imagination should be more neutral than any benchmark. In short, a nice little environment without the slightest conflict of interest.

Difference between synthetic, real and hybrid benchmarks

Synthetics are the most popular and therefore the most well-known. Names like 3dMark and PCMark sound familiar to anyone who has searched for information online at least once. Yet synthetic benchmarks are also unanimously considered the most useless ones. The reason is simple: they measure performance while ignoring any other factors, without reflecting the actual user experience. This means that a manufacturer can easily manipulate their hardware to get the most out of a synthetic benchmark (and vice versa *blink-blink*).

Real world benchmark

This standard of benchmarks focuses on real workloads, i.e. the creation of 2D and 3D models, the decompression of a file, the conversion of a video, etc. The problem is that they are not able to measure the performance of individual components. The performance measured is always that of the entire system. To get around this little granola, the larger magazines tend to use the same configuration, changing the individual component of interest for the review. So we will have the same ram, same mobo, same SSD, etc. but different processors, if they judge a processor of the same brand, and, of course, at least the mobo will be different with the change of processor. These tests should be repeated from time to time, all of them, because drivers change, software is updated, and there are a whole host of variables to consider. But no newspaper/magazine/website assembles a system dozens of times to re-run all the tests, they merely use data from past tests.

Hybrid Benchmarks

Some synthetic benchmarks, including those already mentioned, are now considered hybrids. They perform the usual tests of synthetics, then add some sort of case history related to real-world usage. Here again, we have a type of benchmark that is useless for understanding the performance of individual components. Simplifying... The uninformed usually ask the more experienced: "What PC do you recommend?", followed by: "What should you do with it?". When the answer is: 'A bit of everything', the hybrid benchmark is useful. In all other cases, the RW benchmark is needed.

Smartphone benchmarks

If it is possible to be more useless than a synthetic benchmark, smartphone benchmarks rank right up there. They represent both the apotheosis of the easily faked (even by ordinary users, via app + root access) and the emblem of the substantial difference between a high score and the user experience. Like and more than a PC, a smartphone has to be used for days before it can be judged, and there is no theoretical performance that matters.

So why does everyone publish benchmark results?

Let's start by saying that there is a difference between using them and publishing the results. We at RecensioniVere we use benchmarks but do not publish the results. Use is mainly necessary to see on the fly if the system is stable, if any parts are about to fail, if the mounted unit is faulty, etc. Numbers are also taken into account, as is natural, but let us not fuel the race to see who is slightly better than the other, as the end user will not notice mai the difference.

What about the others?

Compared to only five years ago, many publications have abandoned benchmarks or are limited to publishing only the score of the tested component, without comparisons. Behaviour that we support, needless to say, as this is how it has always been for us. Those who continue to publish them do so for three main reasons. The first is that the average user and fanboy demands the numbers, even when they don't understand them. Benchmarks, especially hybrid-synthetic ones, must be interpreted, it is not enough to just read them. Many times, the published charts show amazing results that do not coincide with the reviewer's own commentary. The second reason is that the charts are taken from other sites, on blogs and forums. For ranking on Google, it is important to have loads of links from other sites and publishing mega charts is one of the easiest ways to get mentioned. The third reason is sponsors and we will see this in more detail in the next section.

How do I influence the outcome of a test?

Last year there was a rumour, confirmed by several media outlets, that Intel (again, Intel) had sent an email to a large group of reviewers, inviting them to contact them before writing about the new AMD Ryzen processors. The news went around the world, no one was warned against spreading it, and a few publications naively admitted that it is normal, that everybody does it. Our page on the "How does it work?", where we talk about this world, has been there for a while and perhaps a minimum of transparency on everyone's part is needed. Having said that, one has to keep in mind that no PR, at least via email, will ever ask for a positive review or to speak ill of the competition. The risk of warnings and hefty fines would be too high!

Guidelines for Reviewers

The giants, but also some tiny giant, are used to proposing guidelines to media/influencers. And, although it is easy to understand how not following them can lead to repercussions, we repeat that the guidelines are proposed and not imposed. These guidelines are presented in different ways and formats, sometimes personal and sometimes addressed to selected/trusted publications. Here we are no longer just talking about CPUs or GPUs or smartphones, it applies to any product.

The Friendly Advice

It is an email, longer than usual, in which the friendly PR person advises how to get the most out of their new product, inviting you to extend these suggestions to your users. It is common, rarely really intrusive, and these emails happen to everyone, even us. They are not real guidelines but simple "PRring' concerning the main characteristics of the product. In short, more than legitimate stuff.

The sincere recommendation

The tones are similar to those of friendly advice but one perceives some detachment. The friend is less of a friend, he has risen to the occasion and recommends what he would like to read. Our product is unrivalled in this... Remember to publish this kind of test... We would prefer you to use this software... If you have a *brand_of_board* use it... Turn it to overlock right away... If you use these settings the photos come out better, I'm saying this for you who then have to publish them... We know you have this problem but it will be solved soon, no need to write about it... It would be interesting to compare these three competing products...

Organised ones

In very rare cases, this leads to an actual to-do list, attached as a Word or PDF document. For some products, the list is accompanied by other information material, where some points are explained step by step. In short you are free to test but in the way you are told to do it. This pressure is exerted on sponsored influencers through very detailed contracts that include penalties if they breach an embargo or talk about this confidential material. They, too, are free to follow the list or not. To publish the review or not. In the end is always a personal choice...

What can influence a test?

Lots of things! Reviewing a CPU without disabling MultiCore Enhancement and the similar means already doping it. Also for CPUs, remember that they are worth very little in game benchmarks when you turn up the graphics settings to the maximum. A video of how a smartphone works, taken just after a reset, is not representative of its actual fluidity. Photographs with smartphones - when published - should always be done with 'point and shoot', without adjusting the settings. The stability of any hardware component cannot be judged simply by running benchmarks. And it cannot be based on a few hours of use anyway. The same applies to notebooks. The power supply used, of which there is little talk, influences performance. At certain junctures, the type of heatsink used or the way the thermal paste is applied also count on that #zero-point benchmark.

Tests that are too technical are not reliable

The previous sub-paragraph may never end. The basic fact is that the more a review descends into the technical, the less reliable it becomes. The more it gets lost in the smoke of numbers, the more AdBlock could interpret the entire review as a giant advertisement. Mind you, some in-depth tests are also good to read. Take those of power supplies: fantastic tests! Lots of little numbers, symbols, acronyms... Then you buy it, it lasts six months and spends another three months in a service centre in Germany. But it was a good experience, wasn't it? Tests are useful for reviewers however. Those results should be assimilated by the reviewer and then translated. Translated well. Today we see endless tables, a list of technical specifications, and three actual lines of review. In videos it is even worse.

SSDs, the toughest to test

About power supplies, it would be enough to talk about stability and then help the reader to consult the label, after checking it. Yet they seem to be the most difficult products to review. In contrast, reviews of SSDs appear everywhere, with the usual 2-3 benchmarks. SSDs and HDDs contain our data, so they should be the best-reviewed components ever. How reliable are they? Tried to answer this question TechReport, in 2015, and goose bumps while reading are guaranteed.

Finally, it is best not to forget that in normal tests, a 500 euro SSD will reach staggering numbers. On a practical level, however, there would be little difference compared to a normal 100 euro SSD.

Temperatures

Playing with temperatures is another easy way to influence readers. For the last ten years or so, infrared thermometers have been all the rage. They are usually the ten euro ones, every now and then you see a 50 euro one. The point is that they could also be 200 and would be equally useless if not calibrated for the type of surface. Surfaces inside a PC are the worst to measure with IR, probes must be used. And even that type of thermometer should be bought of good quality and calibrated, not used as it comes. Showing useless equipment only serves one purpose: deceive readers and inexperienced PR people.

Monitor

Monitors are another big hassle for those who have to review them. Actually, again, it is not at all difficult to do this for the consumer market and gamers, much more complicated for those who work with graphics. But there are those who want the complex tests, the usual little numbers and graphs, so something very simple turns into a lot of poorly executed tests and bad results. The only site that did thorough testing correctly was Xbitlabs. Which in fact closed. No sponsors for those who work well!

The rest of it

Cameras, televisions, routers, keyboards... Almost every category has its slew of botched technical tests. What can I say? When there's a few thousand euros to be invested in a professional device, it may be worthwhile to engage in extensive testing, and even seek out someone who does it properly. For other products, we come back to the marketing and #zero-point discourse. Conditions inside homes are not the same as in a laboratory, a garage or an office. They are not real conditions in real scenarios. And no editor/influencer is able to really test all variables. They sell smoke.

Products selected for reviewers

Back to talking about influences, this time indirectly, ever since the first reviews came out in newspapers and magazines, there has been a habit of only submitting selected products to the reviewers' judgement. Exclusives. It happens with all products that require particularly painstaking quality control. From cars to PC hardware to even medical equipment offered by representatives to facilities. But when it comes to die, there is a further selection phase, which affects all consumers.

Binning

Once the circuits are printed on the silicon slices, these slices are cut into those small rectangles we all know. These rectangles can be processor dies, GPU dies, controller dies, end up on RAM, etc. Before they can be used, however, they must, of course, be tested. In fact not all slice dies are identical. Whoever ordered the rectangles sets criteria according to which they are to be classified. Now, since this discourse is primarily concerned with overclocking, one would think that the only criteria would be the tolerated frequencies and voltages: this is not the case. Among the criteria, we have an extensive list of operations that the rectangle must be able to perform.

Fixed and rated

Those that fail to perform certain operations are, if possible, repaired. In fact, there may be 'spare components' on the rectangles that can be unlocked in the event of faults. At this point, after the tests are finished, the classification. Units with too many errors and those below the minimum standard end up in the foundry. The others are divided according to performance. Practically the whole slice would be made up of the exact same dies that are supposed to have the same performance, but there is a high error rate and only a few rectangles come out perfect.

Marketing

When you read that a new graphics card has the same GPU as a higher-end model but with some features removed, it is nothing more than an imperfect GPU saved from the foundry. And, no, there is no way to make it perform as well as the luckier twin. The same goes for CPUs: the architecture may be the same, but if one model is sold at 100 and another at 600, it means that the 100 model came out pretty bad and it's not worth forcing it. Another reason why it is good to think about it several times before launching with theoverclock. ;)

Silicon lottery

Yet, more or less directly, it is the manufacturers themselves who push the overclocking button. They do so demanding the publication of the appropriate tests. Explaining to reviewers that 'it's the high-end model but lower clocked, maybe up it becomes the same... *blink-blink*'. They do this blatantly, with the high-end products, the perfect rectangles, selling them as already overclocked, without blocks, etc. The consumer is thus directed to buy something that *risks* being better than the price paid. Risk, in fact, because it is by no means said.

Perfectly imperfect

The silicon lottery is so called because all rectangles are different from each other. No manufacturer makes a further selection to ensure high 'overclockability'. In theory, the best parts should also be the most stable under high overclocks, but no one, without testing them individually and specifically for this purpose, can say how much. Usually one should recommend fit the specifications under which the product is sold, at least for the first two years of warranty and possible extensions. Instead, the general advice is to risk and gamble, then spend - in fact - more. Until a few years ago, people even went so far as to reactivate cores disabled by manufacturers because they had failed the tests. Today, it is impossible to do so because they are physically disabled (the same goes for computing units). Yes, there was a time when forum gurus and influencers would say: buy this model so you can activate the failed cores (and flush them).

Partners of manufacturers are not immune

While fixes are generally made by manufacturers, binning is done by partners. That is, those who are going to mount the blessed rectangles on their own products. Thousands of rectangles are spread out on trays and travelled to the factories. There they will be placed on top of the printed circuit boards and tested, a hundred or so at a time, and then catalogued. Video card manufacturers are also crossing their fingers when they receive the trays from Nvidia or AMD, because from the binning they will know the default clocks. The overclocking capability, on the other hand, they will only know once they have assembled the actual video card. That is, if they decide to test them one by one, which they do not. The famous factory overclock is the reasonable one, the one that - according to their data - the little rectangles in that range should be able to handle without melting anything within the 2-3 year warranty period. If the board melts after the three years: so much the better!

Partners of manufacturers are not immune /2

Beware: dies discarded by one brand can be used by another! The trays with the scraps go back to the producer who, after issuing a partial refund to the partner, sells them again to someone with lower demands. Some manufacturers, to protect their prestige, at this point remove their own brand from the products and the brand of the new buyer appears. Others leave it on but charge a little more. That is why some cheap chinooks have components of a certain prestige but do not yield anywhere near as much as higher-end products!

Cherry-picking

And here we finally come to the topic at the beginning of the paragraph! This term is used to indicate a selective choice. In our case: the choice the manufacturer makes when sending new products to reviewers. Products, in principle, are divided as follows:engineering samples, press samplespress samples or review samples), campioni limitati (golden samplesand those we find in the shops.

Engineering samples

We could call them almost-prototypes. They are the result of pre-production and the recipients are usually the brand's partner companies who use the product as a reference model to start working on it. It is very rare that these products end up in the hands of the press or are exhibited at trade fairs. Sometimes, and in very limited quantities, they are sent to major influencers to get an initial opinion, especially in terms of design. When this happens, it is usually PC cases, headsets or mice (but also perfumes, clothing, etc. if we want to get out of the technology sphere). When it comes to processors and video cards, it means that real production is already underway. Something can be found on eBay by searching for 'brand_name + engineering sample', bearing in mind that they are less stable and out of any kind of warranty.

Press samples

These are usually identical products to those in the shops, which, at best, have different packaging. Perhaps they lack accessories or are sent in a spartan manner. For some brands, the sending of samples is a kind of sponsorship, which may stop if there are negative reviews or if the guidelines are not followed. Many 'used, as new' for sale on eBay and similar platforms belong to this category. These products are not subject to any kind of selection.

Golden samples

Golden samples are perfect products. They are verified several times and may also have different, more lush packaging. Gifted are sent to influencers whose opinion matters a lot in a given field, (almost) regardless of the number of followers on social media. For example: a youtuber with 2 million subscribers, who is more or less into all things technology, might receive a regular print sample of a new pair of headphones. Another youtuber, with 'only' 500,000 subscribers but targeting an audiophile audience, will instead receive the golden sample (and a frame for the guidelines ;) ).

Golden samples /2

Last year, the case of some SuperNova B3 power supplies branded EVGA and manufactured by Super Flower caused a stir. It emerged, through the voice of the protagonists themselves, that EVGA had sent jonnyguru.com, the global benchmark for power supply reviews, golden samples, free of any problems. Like everyone does The problem (theirs) is that more and more websites, just like RecensioniVere does, are no longer going after the games of some lousy PR and are buying the products for review on their own. To EVGA's (and Super Flower's) misfortune, it was an industry information giant that bought the power supplies: Tom’s Hardware. Not only failed TH's tests, but the power supplies showed serious safety flaws, complete with sparks! And when TH asked EVGA for samples to repeat the tests, the answer was, needless to say: NO. And this is not the first time EVGA's products play with fire, they had problems with the GTX 1070 and 1080 in 2016.

Welcome to the real world!

After all that, what is the point of in-depth tests any more? Tests that, more than once, have proved incapable of replicating real conditions of use, becoming a kind of alternative synthetic benchmark. More useful to social parrots, to repeat results by heart and set themselves up as experts, than to us consumers. Some might argue using the case cited earlier as an example, since we don't test protection either. We use the products, we don't keep them on for a couple of hours and then resell them.If a power supply is problematic, sooner or later it will give problems to those of us who fitted it, and power supplies stay on the market for many years... This is why we always update reviews when there are problems: to warn everyone about a defect that only the real user may know. Not to mention, sometimes we also suggest how to fix a faulty product.

Between guidelines and limited samples

If the submitted product must be tested under optimal conditions for a successful review. If, indeed, this product may also be different from the one on the market. So what are we going to read? What benchmarks are we talking about? What is the credibility of the test results? For us, as will have happened to many readers, the 100 euro office chairs broke. On Youtube it is full of people enthusiastic about theirs. Do we live in a parallel world? Years ago we reviewed a Gigabyte with noticeable design problems that we could only highlight because it was a purchase and not some kind of PR bribe.

Between real manufacturers, stickers and cuts

Outsourcing is a problem. We see one brand, but the producer is often another. We search for information on the web, and sometimes this is available, but the external partner may in turn commission someone else. Then the trail is lost. Often rumours are spread that a product with a certain initials, or a different 'made in', is of higher quality. Just as often, those rumours are absolutely true. Chips, batteries, capacitors, printed circuit boards, every single component of the devices we own may have been sub-subcontracted. And the more 'subs', the lower the costs, the worse the result in terms of efficiency, stability and durability. It would be time to have a law obliging brands to declare this. The limited samples sometimes are not assembled in factories but in workshopsand then in the shop we buy the sub-sub-sub-sub...

What can we do? We are old...

The most important thing, when crunching benchmark numbers, would be to understand what you are reviewing: the motherboard? The CPU? A game? And try to focus on that.

For us RV is an hobbywe don't make money and we don't lose money. Ten years ago there were very few industry websites in Italy and a couple of influencers. Today there are many more but the quality has deteriorated in the same proportion. Among industry editors we sometimes talk, we exchange emails, and some of the causes of the deterioration are reported in the article. Having set up an equal relationship with the reader, as a conversation between friends, between peers, has raised the average age of the site, and somehow we have escaped the anxieties of youthful targets. We will remain oldies among oldies, which is fine with us. ;)

P.S. The power supply brand Seasonicin order to be above suspicion, it started shipping its samples through Amazon. This seems to us a good starting point, looking forward to seeing it do this to non-sponsors as well.