IEEESTEC 17TH (2024), (pp. 375–378)
АУТОР / AUTHOR(S): Veljko Pavlović
DOI: 10.46793/IEEESTEC17.375P
САЖЕТАК / ABSTRACT:
U ovom radu razvijen je algoritam za poređenje struktura HTML stranica korišćenjem graph2vec modela. Graph2vec model je tehnika ugrađivanja grafova koja HTML strukture prevodi u vektorski prostor. Rezultati su evaluirani pomoću vizuelne metode koja se sastoji od renderovanja slika veb-stranica i izračunavanja sličnosti među njima. Za poređenje stranica korišćene su dve metrike: euklidsko rastojanje i kosinusna sličnost, koja je prebačena u interval [0,1] radi izražavanja rezultata u procentima. Rezultati su pokazali da na vizuelnu metodu značajno utiču stilski elementi, poput boja, pozadine i drugih vizuelnih atributa stranice. S druge strane, metoda ugrađivanja grafova koristi strukturu HTML stranice i fokusira se isključivo na organizaciju i povezanost elemenata što se pokazalo kao bolje rešenje.
КЉУЧНЕ РЕЧИ / KEYWORDS:
ugrađivanje grafova, struktura HTML stranica, graph2vec, klasterovanje
ЛИТЕРАТУРА/ REFERENCES:
- Yamoun, L., Guessoum, Z., & Girard, C. (2022). Transformer RoBERTa vs. TF-IDF for websites content-based classification. CReSTIC EA 3804, University of Reims Champagne Ardenne, France; Efficient IP, La Garenne-Colombes, France; LIP6 UMR 7606, Sorbonne University, France.
- Li, X., Zhang, W., Wang, D., Zhang, B. & He, H. (2018). Algorithm of web page similarity comparison based on visual block. Computer Science and Information Systems, 16(3), pp. 815–830.
- Cruz, I. F., Borisov, S., Marks, M. A. & Webb, T. R. (1998). Measuring structural similarity among web documents: preliminary results. Proceedings of the 7th International Conference on Electronic Publishing, pp. 513–524. ICCC Press.
- Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y. & Jaiswal, S. (2017). Graph2vec: Learning distributed representations of graphs. Nanyang Technological University
- Grover, A. & Leskovec, J. (2016), node2vec: Scalable Feature Learning for Networks, in ‘Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’ , ACM Press, New York, NY, USA , pp. 855–864 .
- Beautiful Soup. (n.d.). Beautiful Soup 4.12.0 documentation. Dostupno na: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ [Pristupljeno 6. Oktobra 2024.]
- Ester, M., Kriegel, H.-P., Sander, Jö. & Xu, X. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, in ‘Proc. of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)’ , pp. 226-231. AAAI Press.
- MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability’ , University of California Press, , pp. 281-297.
- (2023). Selenium WebDriver. Dostupno na: https://www.selenium.dev/ [Pristupljeno 6. Oktobra 2024.].
- (n.d.). Skimage—Skimage 0. 24. 0 documentation. Dostupno na https://scikit-image.org/docs/stable/api/skimage.html [Pristupljeno 6. Oktobra 2024.]
- Jolliffe, I. T. (1986). Principal Component Analysis. Springer Series in Statistics. Springer-Verlag New York.
- Hubert, L. & Arabie, P. (1985), ‘Comparing partitions’, Journal of classification 2 (1), 193–218.