The demand for tests appeared almost simultaneously with the development of the first antivirus programs – in the mid-to-late 1990s. Demand created supply: test labs at computer magazines started to measure the effectiveness of security solutions with the help of self-made methodologies, and later an industry of specialized companies emerged with a more comprehensive approach to testing methods.
The first primitive tests scanning huge collections of malicious and supposedly malicious files taken from everywhere were rightfully criticized first and foremost by the vendors. Such tests were characterized by inconsistent and unreliable results, and few people trusted them.
More than 20 years have passed since then. In that time, protection solutions have continuously evolved and become more and more efficient thanks to new technologies. However, the threats have developed too. In their turn, the laboratories have been steadily improving their testing methods, devising the most reliable and accurate procedures of measuring the work of security solutions in different environments. This process is neither cheap nor easy, and that’s why we now have a situation where the quality of testing depends on the financial standing of a laboratory and its accumulated expertise.
— Eugene Kaspersky (@e_kaspersky) 3 февраля 2017 г.
When it comes to the costs of testing, the main question is: cui prodest? (Lat. who benefits?) And this is one of those rare cases where high-quality work benefits everyone: both vendors and customers. The fact is that independent testing is the only way to evaluate the effectiveness of a security solution and to compare it with its competitors. Other methods simply do not exist. For a potential customer, testing plays an important role in understanding what kind of product best meets their needs. For vendors it provides an opportunity to keep up to date with the industry; otherwise, they may not notice when their product falls behind the competition, or development is heading in the wrong direction. This practice is also used in other industries, for example, EURO NCAP independent security tests.
However, the world is changing, and new solutions are emerging for old problems. The cybersecurity industry welcomes products based on modern approaches, such as machine learning. The potential opportunities are breathtaking, although it may take years before we see them implemented effectively. Of course, in their marketing materials the vendors of so-called next-gen anti-malware software are keen to talk about the incredible capabilities of their products, but we shouldn’t believe everything they say – it’s better to continue comparing solutions based on test results.
Basic testing methodologies
The question “How do you test?” was asked by the testers in the early years of the industry. They still ask themselves this question to this day. Testing methodologies have evolved in line with the evolution of cyber threats and security solutions. It all started with the simplest method – an on-demand scan.
On-demand scan (ODS). The test lab collects all types of malicious programs (mainly files already infected with malware – nowadays it’s mostly with Trojans), adds them to a folder on the hard disk, and then launches the security product in question to test it against the entire collection. The more the product catches, the better it is. Sometimes in the course of testing, files are copied from one folder to another, which is slightly closer to a real work scenario.
At one time this method used to be enough, but now the lion’s share of advanced security technologies don’t work with this type of testing, meaning it’s not possible to assess how effectively solutions actually counter the latest threats. Nevertheless, ODS is still used, often in combination with more advanced methods.
On-execute test. This is the next stage in the development of test methods. A collection of samples is copied and launched on a machine where the security software is running, and the reaction of the security solution is recorded. This was once viewed as a very advanced technique, but its shortcomings were soon revealed in practice. A modern cyber-attack is carried out in several stages. A malicious file is just one part of that attack and isn’t intended to function on its own. For instance, the sample may be waiting for command-line parameters, require a specific environment (e.g., a specific browser), or it may be a module in the form of a DLL that connects to the main Trojan, rather than running on its own.
Real-world test (RW). This is the most complicated test method, but also the closest to reality, imitating the full cycle of infecting a system. The testers open a malicious file delivered via email on a clean system with the security solution installed, or via a browser following a real malicious link to check whether the whole infection chain works, or if the solution being tested was able to stop the process at some stage.
These types of tests reveal various problems that security software may experience when working in the real world with real threats.
However, this method of testing requires serious preparatory work. Firstly, full-scale testing of a hundred or more samples requires a large number of machines, or a lot of time – something that few laboratories can afford. Secondly, many of today’s Trojans can tell if they are being launched in a virtual environment and won’t work, hindering the efforts of any researchers trying to perform an analysis. Therefore, in order to obtain the most reliable results, the test laboratory must use physical computers, rebooting the system after running each malware sample. Or to use virtual machines, but be more careful while selecting samples for testing.
Another difficult task is generating extensive databases of malicious links. Many of them are used just once or work with restrictions (for example, only in certain regions). And the quality of the RW test is highly dependent on how good the laboratory is at finding such links and whether it handles them correctly. Obviously, there is no point if one product works when a link is opened, while the others cannot be tested because the link has subsequently “died”. To achieve accurate results, there needs to be a large number of these links, but providing them can be difficult.
Behavior or proactive test. According to this technique, a security solution is tested using samples that are unknown to it. To do this, the experts install the test product and do not update it for several months. A collection of malicious programs that appeared after the last update is then used to perform the ODS and on-execute tests. Some test companies pack or obfuscate known threats to test the security solution’s ability to identify malicious behavior. It is quite difficult to conduct this sort of test properly. For example, some laboratories disable updates by blocking Internet access on the test machine, thus depriving the security solution access to the cloud. This may cause some technologies to fail, meaning the products are not being tested on an even playing field and undermining the relevance of the test.
Removal or remediation (test for the complete removal of malware). This checks the ability of a security solution to treat the system, i.e., clean autorun keys, remove a task scheduler and other traces of malware activity. This is an important test, because poor treatment may cause problems when booting or using the OS or, what is worse, the restoration of the malware in the system. During testing, a “clean” system is infected with a malicious program from the collection of malware samples. The computer is then rebooted, and a security solution with the latest updates is installed. The majority of testers perform the test to check the quality of a solution’s treatment; it may be in the form of a separate test or as part of an RW test.
Performance test. This test evaluates how effectively a security solution uses the system resources. To do this, the speed of various operations are measured with the security solution installed and running. These operations include system boot, file copying, archiving and decompression, and the launching of applications. Test packets simulating realistic user work scenarios in the system are also used.
False positive test. This test is necessary to determine the reliability of the final evaluation. Obviously, an anti-malware program that detects all programs as malicious, will receive scores of 100% in the protection category, but will be useless for the user. Therefore, its reaction to legitimate applications has to be checked. To do this, a separate collection of popular software installation files is created and tested using different scenarios.
Feedback. This is not a test methodology, but rather the most important stage of any test, without which the results cannot be verified. After performing all the tests, the laboratory sends the preliminary results achieved by the product to the respective vendor, so that they can check and reproduce the findings and identify any errors. This is very important because a test laboratory simply doesn’t have the resources to check even every hundredth case, and errors are always possible. And those errors are not necessarily caused by the methodology. For example, during an RW test an application can successfully penetrate a machine but not perform any malicious actions because it is intended for a different region, or it is not initially malicious, being used instead for advertising purposes. However, the program was installed and the security solution didn’t block it, resulting in a reduced score.
At the same time, a security solution is designed to block malicious actions. In this case, however, no malicious actions were performed, and the anti-malware program worked as planned by the vendor. It is only possible to handle such cases by analyzing the code of the sample and its behavior; in most cases, the test company doesn’t have the resources for that and the vendor’s help is required.
Another layer of methodologies are those used for in-depth testing of specific types of threats or specific security technologies. Many customers need to know which solution is most effective against encryptors, for example, or which product offers online banking systems the best protection. An overall evaluation of a security solution is not very informative: it only shows that one product is “no worse than the others”. This is not enough, so a number of test laboratories carry out specialized tests.
Exploits. Counteracting exploits is more difficult than detecting malware samples and not all security solutions can do it successfully. To assess exploit prevention technologies, laboratories use RW tests: the testers collect links to exploit packs, follow them on a clean machine, record the traffic and reproduce it for all the anti-malware solutions in a test. In order to make the experiment as pure as possible, some companies, in addition to real exploit packs, create their own exploits using frameworks like Metasploit. This makes it possible to test a security solution’s reaction to the exploitation of software vulnerabilities using unknown code.
Financial threats. Internet banking and bank client systems are very popular vectors of attack with cybercriminals because they offer direct financial benefits. A number of specific technologies are used, for example, substitution of web page content or remote system management, and it is necessary to check how well a security solution counteracts them. Also, many vendors offer specialized security technologies to protect against financial threats (for example, our SafeMoney); their effectiveness is also checked by these tests.
Special platforms. The vast majority of tests are carried out on the most common platform – an up-to-date version of Microsoft Windows for desktops. However, users are sometimes interested in the effectiveness of a security solution on other platforms: Android, Linux, Mac OS, Windows servers, a mobile operating system, or even earlier Windows versions (e.g., Windows XP still works on most ATMs, and banks are unlikely to upgrade any time soon). These tests are usually carried out according to the simplest methods, because demand for them is minimal.
Types of tests
Besides methodology, tests differ by type. A security solution can be tested independently of competitors (certification) or together with competitors (comparative test). Certification only determines whether a solution combats existing threats effectively. Comparative tests take up more of a laboratory’s resources, but offer the vendor and potential customer more information.
Tests also differ in terms of their frequency and the methods used to calculate the results. Most test companies carry out regular tests that are performed every one to six months. The results of each test are calculated independently: the same solution can be scored highly in one test, while in the next test it may get a low score, or vice versa. In addition to the features of the product itself, this depends on the collections used or changes to the methodology.
In the case of continuous testing, security solutions are also tested at regular intervals, but scores are given cumulatively, for example, every six months following monthly testing, or both cumulatively and for each test. Of all the numerous tests conducted, the most indicative for the consumer and the most important for the vendor are continuous tests. Only the results of “a long distance race” make it possible to calculate the results obtained from different product versions with a variety of collections and to achieve the most relevant product evaluation.
The main thing about continuous tests is that only their results allow to both evaluate the effectiveness of previous and current versions of the product and understand how it is likely to behave in the foreseeable future. High scores in annual or longer testing suggest that constant work is being done on the product. It is not only receiving database updates; the developers are closely watching the threat landscape and responding to changes.
There are numerous companies on the anti-malware testing market. This is undoubtedly beneficial for the industry, as each of them carries out tests on their own collections, and assessments from different laboratories make it possible to evaluate the overall quality of a security solution. However, it should be noted that not all test companies use sufficiently developed methodologies. Below are the best known test laboratories.
AV Comparatives. This Austrian company is one of the oldest on the anti-malware testing market. It specializes in security solutions for B2B and performs a range of tests including RW tests on its own closed collection. The tests are held every month over a period of 10 months, and the best vendors are awarded the Product of the Year or the Top Rated title. Once a year its analysts test solutions for Android and Mac OS, but only using the ODS methodology.
AV Test. A German company founded more than 10 years ago and currently the biggest player on the market. It conducts monthly comparative RW tests, the results of which are provided every two months; it also uses the ODS + ODS + OES methodology (i.e., the sample is scanned, run and scanned again). It conducts performance tests. The company also tests Android and twice a year tests Mac OS using the ODS method.
MRG. Based in the UK, the company has been testing since 2009. It specializes in in-depth technology tests and carries out quarterly comparative RW tests (360 Assessment). MRG also tests financial threats (online banking test) and occasionally performs exploit prevention tests. It also carries out various on-demand tests.
SELabs was founded by former DTL employee Simon Edwards. It has replaced DTL on the market and uses the same set of tests. It conducts quarterly RW tests, including exploit-prevention tests, using its own attacks devised on the basis of frameworks.
Virus Bulletin. It conducts very simple test certification based on ODS methods using the static Wildlist collection, available for download by vendors. It also carries out behavioral tests of anti-malware solutions whose databases have not been updated for several months.
ISCA. This US company is a division of Verizon. It only performs certification tests and additionally tests Anti-APT class solutions.
NSS Labs. Yet another US company focused on the corporate segment. Its results are not published, but instead provided on a paid subscription. The company’s arsenal includes RW tests, exploit prevention tests and protection against APTs.
Magazines and online publications such as PC Magazine, ComputerBild, Tom’s Hardware Guide, etc., also carry out their own antivirus tests. However, their approach is not very transparent: they don’t make their malware sample collections public and don’t provide feedback to the vendor.
In addition to the aforementioned market players, there are plenty of local players conducting irregular or specific tests at the request of vendors. Care should be taken when evaluating their results because their methodology is often not transparent, and the selection of comparative testing participants is far from complete. It should be noted that a high-quality methodology is not invented overnight; it is a complex and expensive process that takes years of work, requiring significant resources and expertise. Only such tests provide an objective picture of the security product market.
How to win tests
To come out top in testing, it’s not enough to simply introduce new technologies; coming first in a test is always the result of arduously and continuously correcting errors. The more high-quality tests a product participates in, the more accurate the information developers get about where to look for challenges and shortcomings.
Winning in a simple test carried out by a company that doesn’t have a well-developed methodology can be useful for marketing: you can place a label on the product box and issue a press release, but experts know that it’s necessary to look at the results of the flagship tests. The most important thing about them is a clear, fully developed methodology that corresponds to the modern threat landscape.
Even first place in a test where product performance and false positives are not tested, doesn’t reveal anything about a product’s ability to cope with modern threats without having a negative impact on the user. It’s easy to achieve 100% protection, but much harder to do so without interfering with the user’s work. And if a vendor doesn’t participate in the more advanced tests, it’s a clear sign that they have some serious problems with product performance.
Read more here:: Securelist