Improving Website Categorization Based on HTML Tag Statistics for Blocking Unwanted Content
Abstract
Introduction: The continuous development and ubiquity of the Internet lead to a higher complexity of detecting unwanted and malicious information. The existing systems usually use automatic classification by textual content of websites, but this approach cannot be applied to websites with changeable content like news, forums, etc. Purpose: The goal is to enhance the protection against unwanted or inappropriate information through improving the categorization quality by using Data Mining techniques for automated parental control systems. Results: Improved algorithms have been developed for website classification, along with a prototype of a parental control system. The novelty of the proposed approach is using not the textual content but the statistics of HTML tags (the ratio of the number of occurrences of a certain tag on a page to the total number of all tags on this page). The algorithm selects 25 main tags from a set of websites and then calculates tags’ statistics for each website. The paper also describes the architecture of the categorization system which consists of several Perl modules and special RapidMiner software. For the developed prototype, some experiments on preformed datasets were carried out, with the comparison of categorization quality between text, structure features and their combinations. The results showed that the analysis of tag statistics is not sufficient to replace all the other methods. But it can be a useful complement to the existing systems with textual classification, able to increase their quality from 6.9 to 10.6% in accuracy metrics, depending on the number of categories. Practical relevance: This approach can be used to improve the efficiency of search for information forbidden by the laws of the Russian Federation (propaganda of extremism, pornography, drugs, anti-social behavior, etc). Also, this approach can be used in parental control systems to deny access to certain types of information according to age categories.Published
2016-12-19
How to Cite
Novozhilov, D., Chechulin, A., & Kotenko, I. (2016). Improving Website Categorization Based on HTML Tag Statistics for Blocking Unwanted Content. Information and Control Systems, (6), 65-73. https://doi.org/10.15217/issn1684-8853.2016.6.65
Issue
Section
Hardware and software resources