基于海量互聯(lián)網(wǎng)網(wǎng)頁拓?fù)浣Y(jié)構(gòu)的作弊鏈接與惡意網(wǎng)頁挖掘

發(fā)布時(shí)間：2019-04-03 20:50

【摘要】：萬維網(wǎng)提供了大量的信息,任何人都可以訪問它。為了識(shí)別網(wǎng)頁中大量最有價(jià)值的信息,互聯(lián)網(wǎng)用戶主要依賴于搜索引擎。搜索引擎通常對(duì)大量網(wǎng)頁進(jìn)行分類,并且根據(jù)查詢相關(guān)性與網(wǎng)頁排名給出與用戶查詢最相關(guān)的網(wǎng)頁。用戶通常訪問排名最高的網(wǎng)頁,忽略其余部分。所以,為了吸引更多互聯(lián)網(wǎng)用戶點(diǎn)擊,每個(gè)網(wǎng)頁在搜索引擎中獲得較高排名是非常重要的。搜索引擎是幫助用戶找到所需內(nèi)容的主要方法。為了針對(duì)用戶的查詢建議并給出最密切相關(guān)和最流行的網(wǎng)頁,搜索引擎會(huì)根據(jù)某些算法向每個(gè)網(wǎng)頁分配排名,該排名通常隨著鏈接到該頁面的其他網(wǎng)站的數(shù)量和排名而增加。然而,作弊鏈接攻擊者已經(jīng)開發(fā)出幾種技術(shù)來應(yīng)對(duì)這些算法,并提高其自身網(wǎng)頁的排名。這些技術(shù)通�；谟糜趨f(xié)作鏈路交換的地下鏈接,在作弊鏈接開發(fā)者之間建立關(guān)系網(wǎng)絡(luò)以在搜索引擎結(jié)果中提高他們的網(wǎng)頁排名。本文研究如何在海量互聯(lián)網(wǎng)節(jié)點(diǎn)和邊之上識(shí)別針對(duì)搜索引擎的作弊鏈接與作弊網(wǎng)頁,收集互聯(lián)網(wǎng)上的網(wǎng)頁以及他們之間的超文本鏈接,構(gòu)造一個(gè)互聯(lián)網(wǎng)拓?fù)?研究和分析這些作弊鏈接構(gòu)成的子圖在整個(gè)拓?fù)浣Y(jié)構(gòu)中的特征,并通過擴(kuò)展的方式追蹤這些作弊鏈接的指向關(guān)系,從而識(shí)別互聯(lián)網(wǎng)上的作弊網(wǎng)頁。在本文研究中,我們對(duì)作弊網(wǎng)頁與作弊鏈接拓?fù)浣Y(jié)構(gòu)特征進(jìn)行了比較全面的分析與總結(jié),預(yù)測(cè)了作弊鏈接拓?fù)浣Y(jié)構(gòu)特征,并且根據(jù)作弊網(wǎng)頁分類與作弊鏈接拓?fù)浣Y(jié)構(gòu)特征提出了基于互聯(lián)網(wǎng)網(wǎng)頁拓?fù)浣Y(jié)構(gòu)的作弊鏈接與惡意網(wǎng)頁挖掘模型,并在該模型中提出了一種簡(jiǎn)單但高效的種子節(jié)點(diǎn)采集與擴(kuò)展算法。在擴(kuò)展種子集時(shí),可以在鏈接農(nóng)場(chǎng)中找到一些頁面作為種子集,則對(duì)于每個(gè)新頁面,如果頁面具有從和到達(dá)的多個(gè)入站鏈路和出站鏈路,則該頁面很可能是同一鏈接農(nóng)場(chǎng)的一部分種子集。然后可以通過添加新頁面來擴(kuò)展種子集。得到種子集后,需要擴(kuò)展步驟來在數(shù)據(jù)集中找到更多的壞頁,才能建立作弊鏈接拓?fù)浣Y(jié)構(gòu)。進(jìn)行擴(kuò)展步驟時(shí),如果一個(gè)頁面指向一堆壞頁面,很可能這個(gè)頁面本身是壞的。因此,從一個(gè)頁面擴(kuò)展到鏈接頁面,盡管這里遵循入站鏈接而不是出站鏈接。為了驗(yàn)證本文所提出的模型對(duì)于互聯(lián)網(wǎng)上作弊網(wǎng)頁挖掘的性能,本文使用Python爬蟲模塊進(jìn)行網(wǎng)頁挖掘,實(shí)驗(yàn)數(shù)據(jù)根據(jù)爬取時(shí)間分為三組,共計(jì)9.5萬頁,這些頁面位于8452個(gè)不同的域中。其中標(biāo)記作弊網(wǎng)頁數(shù)共計(jì)6208個(gè),得到的種子節(jié)點(diǎn)180個(gè)。通過三組實(shí)驗(yàn)數(shù)據(jù)顯示,本文所提出的基于互聯(lián)網(wǎng)網(wǎng)頁拓?fù)浣Y(jié)構(gòu)的作弊鏈接與惡意網(wǎng)頁挖掘模型的綜合準(zhǔn)確率為83.3%,基本上達(dá)到了檢測(cè)作弊網(wǎng)頁與鏈接農(nóng)場(chǎng)的目的。并且通過實(shí)驗(yàn)數(shù)據(jù)所繪制的作弊鏈接拓?fù)浣Y(jié)構(gòu)與作弊鏈接拓?fù)浣Y(jié)構(gòu)特征預(yù)測(cè)所預(yù)測(cè)的拓?fù)浣Y(jié)構(gòu)基本一致,從而證明了本文中所對(duì)作弊鏈接拓?fù)浣Y(jié)構(gòu)的猜想是基本正確的。進(jìn)一步,通過跟蹤這些作弊鏈接的指向,找到他們所服務(wù)的作弊網(wǎng)頁,并將這些網(wǎng)頁進(jìn)行舉報(bào)或者公示,從而達(dá)到降低這些作弊網(wǎng)頁在搜索引擎中曝光的幾率,維護(hù)互聯(lián)網(wǎng)安全。
[Abstract]:The World Wide Web provides a large amount of information, and anyone can access it. In order to identify a large number of most valuable information in a web page, the Internet users rely primarily on search engines. The search engine typically classifies a large number of web pages and gives the most relevant web page to the user based on the query relevance and the web page ranking. The user usually accesses the highest ranked web page and ignores the rest. Therefore, in order to attract more Internet user clicks, each web page is highly ranked in a search engine. The search engine is the main method to help users find the desired content. In order to suggest and give the most closely related and most popular web pages for the user's query suggestions, the search engine will assign a ranking to each web page in accordance with certain algorithms that generally increase with the number and ranking of other sites linked to the page. However, cheating-linked attackers have developed several techniques to address these algorithms and improve their own web page ranking. These techniques are typically based on an underground link for collaborative link exchange and a relationship network is established between the cheating link developers to improve their web page ranking in search engine results. In this paper, how to identify the cheating link and the cheating web page for the search engine on the mass Internet nodes and edges, to collect the web pages on the Internet and the hypertext links between them, to construct an Internet topology, The characteristics of the subgraph formed by these cheating links in the whole topology structure are studied and analyzed, and the pointing relation of the cheating links is tracked through the expanded mode, so as to identify the cheating webpage on the Internet. In that study of this paper, we make a comprehensive analysis and summary of the topological structure of the cheating link and the cheating link, and predict the topological structure of the cheating link. And a simple but efficient seed node acquisition and expansion algorithm is proposed in the model. When you expand a seed set, you can find some pages as a set of seeds in the linked farm, and for each new page, if the page has multiple inbound and outbound links from and to, the page is likely to be a subset of the seed set on the same linked farm. You can then expand the seed set by adding a new page. After you get the seed set, you need to expand the steps to find more bad pages in the data set before you can establish a cheating link topology. When an expansion step is performed, if a page points to a pile of bad pages, it is possible that the page itself is bad. Therefore, extend from one page to the linked page, although it follows the inbound link rather than the outbound link. In order to verify the performance of the model proposed in this paper for the web page mining on the Internet, this paper uses the Python crawler module to carry out web-page mining. The experimental data is divided into three groups according to the time-climbing time, and the total amount is 9.5 million pages, which are located in 8452 different domains. The number of the marked cheating pages is 6,208, and the resulting seed nodes are 180. Through three groups of experimental data, the comprehensive accuracy rate of the cheating link and the malicious web page mining model, which is based on the Internet web page topology, is 83.3%, and the purpose of detecting the cheating webpage and linking the farm is basically achieved. And the topological structure of the cheating link and the topological structure of the cheating link topological structure are basically consistent with the predicted topological structure, so that the conjecture of the cheating link topological structure in the paper is basically correct. Further, by tracking the points of the cheating links, finding the cheating web pages they serve, and reporting or publishing the web pages, the chances of reducing the exposure of the cheating web pages in the search engine are reduced, and the Internet security is maintained.
【學(xué)位授予單位】：吉林大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP393.092

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 吳坤華;論分布式計(jì)算機(jī)系統(tǒng)常見拓?fù)浣Y(jié)構(gòu)的優(yōu)劣和兩種拓?fù)浣Y(jié)構(gòu)的綜合[J];龍巖師專學(xué)報(bào);1990年02期

2 程代展，，泰化淑，洪奕光;穩(wěn)定反饋空間的拓?fù)浣Y(jié)構(gòu)[J];自動(dòng)化學(xué)報(bào);1995年03期

3 劉連元;漢字拓?fù)浣Y(jié)構(gòu)分析（續(xù)）[J];電子出版;1995年07期

4 張桂月;走進(jìn)網(wǎng)絡(luò)空間(二)[J];管理信息系統(tǒng);1999年10期

5 吳萍;論職教刊物的拓?fù)浣Y(jié)構(gòu)問題[J];南昌職業(yè)技術(shù)師范學(xué)院學(xué)報(bào);2000年03期

6 劉紅霞;譚璐;吳翊;;保持拓?fù)浣Y(jié)構(gòu)的低維嵌入[J];計(jì)算機(jī)應(yīng)用與軟件;2007年07期

7 王若輝;;具有子通信拓?fù)浣Y(jié)構(gòu)的群集模型的建立[J];齊齊哈爾大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年03期

8 廖龍俊;;怎樣實(shí)現(xiàn)內(nèi)外網(wǎng)同時(shí)訪問[J];電腦編程技巧與維護(hù);2012年07期

9 魯斌,何華燦;聯(lián)想思維的超拓?fù)浣Y(jié)構(gòu)模型[J];小型微型計(jì)算機(jī)系統(tǒng);2004年06期

10 馮志全;楊波;鄭艷偉;唐好魁;李毅;;圖像拓?fù)浣Y(jié)構(gòu)的識(shí)別及其應(yīng)用技術(shù)研究[J];系統(tǒng)仿真學(xué)報(bào);2008年24期

相關(guān)會(huì)議論文前10條

1 張國(guó)峰;林新;王行仁;;分級(jí)多聯(lián)邦系統(tǒng)的樹拓?fù)浣Y(jié)構(gòu)研究[A];加入WTO和中國(guó)科技與可持續(xù)發(fā)展——挑戰(zhàn)與機(jī)遇、責(zé)任和對(duì)策（下冊(cè)）[C];2002年

2 呂超;劉爽;王世明;張麗珍;;基于布局拓?fù)浣Y(jié)構(gòu)的制造系統(tǒng)形性分析摘要[A];中國(guó)系統(tǒng)工程學(xué)會(huì)第十八屆學(xué)術(shù)年會(huì)論文集——A06相關(guān)學(xué)科與系統(tǒng)工程方法[C];2014年

3 李光輝;馮冬芹;曾松偉;;基于拓?fù)浣Y(jié)構(gòu)分析的等價(jià)性驗(yàn)證方法[A];第五屆中國(guó)測(cè)試學(xué)術(shù)會(huì)議論文集[C];2008年

4 俞亞東;黃曉春;;一例具有蜂窩型拓?fù)浣Y(jié)構(gòu)的二維聚合物直接發(fā)射白光[A];中國(guó)化學(xué)會(huì)第29屆學(xué)術(shù)年會(huì)摘要集——第13分會(huì)：晶體工程[C];2014年

5 房立豐;劉安心;常興;武光華;李永;;一平移三轉(zhuǎn)動(dòng)并聯(lián)穩(wěn)定平臺(tái)拓?fù)浣Y(jié)構(gòu)設(shè)計(jì)[A];第9屆中國(guó)機(jī)構(gòu)與機(jī)器科學(xué)應(yīng)用國(guó)際會(huì)議（CCAMMS 2011）暨中國(guó)輕工機(jī)械協(xié)會(huì)科技研討會(huì)論文集[C];2011年

6 劉連元;;漢字拓?fù)浣Y(jié)構(gòu)分析[A];語言文字應(yīng)用研究論文集（Ⅰ）[C];1995年

7 王長(zhǎng)生;;中國(guó)公用交互網(wǎng)(中國(guó)Internet)的發(fā)展與未來[A];四川省通信學(xué)會(huì)一九九六年學(xué)術(shù)年會(huì)論文集[C];1996年

8 林新;王行仁;彭曉源;;基于樹拓?fù)浣Y(jié)構(gòu)的分級(jí)多聯(lián)邦系統(tǒng)及其時(shí)間管理策略[A];2003年全國(guó)系統(tǒng)仿真學(xué)術(shù)年會(huì)論文集[C];2003年

9 胡云崗;陳軍;李志林;趙仁亮;;基于拓?fù)浣Y(jié)構(gòu)的道路選取方法研究[A];中國(guó)測(cè)繪學(xué)會(huì)第八次全國(guó)會(huì)員代表大會(huì)暨2005年綜合性學(xué)術(shù)年會(huì)論文集[C];2005年

10 樂永年;路燕;施宇宏;施伯樂;;基于簇的Web文檔拓?fù)浣Y(jié)構(gòu)的存儲(chǔ)方案[A];第十八屆全國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2001年

相關(guān)重要報(bào)紙文章前3條

1 中科院計(jì)算所羅杰文;P2P網(wǎng)絡(luò)的拓?fù)浣Y(jié)構(gòu)[N];計(jì)算機(jī)世界;2006年

2 李嘉悅;衛(wèi)星家族新面孔網(wǎng)絡(luò)衛(wèi)星[N];北京科技報(bào);2002年

3 胡德榮;中德科學(xué)家聯(lián)手分析蛋白質(zhì)拓?fù)浣Y(jié)構(gòu)和功能[N];健康報(bào);2006年

相關(guān)博士學(xué)位論文前10條

1 張青;微米拓?fù)浣Y(jié)構(gòu)的構(gòu)建及其用于骨和軟骨修復(fù)的研究[D];華南理工大學(xué);2015年

2 杜文強(qiáng);各向異性拓?fù)浣Y(jié)構(gòu)和剛度細(xì)胞培養(yǎng)基底及其在組織工程中的應(yīng)用[D];中國(guó)科學(xué)技術(shù)大學(xué);2016年

3 李鑫;多節(jié)點(diǎn)拓?fù)浣Y(jié)構(gòu)下隨機(jī)耦合模型研究[D];清華大學(xué);2015年

4 羅桂蘭;嵌入式互聯(lián)網(wǎng)宏觀拓?fù)浣Y(jié)構(gòu)及統(tǒng)計(jì)時(shí)間特征研究[D];東北大學(xué);2009年

5 徐峰;互聯(lián)網(wǎng)宏觀拓?fù)浣Y(jié)構(gòu)中社團(tuán)特征演化分析及應(yīng)用[D];東北大學(xué);2009年

6 張文波;Internet宏觀拓?fù)浣Y(jié)構(gòu)的生命特征研究[D];東北大學(xué);2006年

7 夏瓊;明度對(duì)比和拓?fù)浣Y(jié)構(gòu)在視知覺中的競(jìng)爭(zhēng)[D];浙江大學(xué);2008年

8 程學(xué)旗;信息網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)與內(nèi)容相關(guān)性研究[D];中國(guó)科學(xué)院研究生院（計(jì)算技術(shù)研究所）;2006年

9 張君;Internet路由級(jí)拓?fù)浣Y(jié)構(gòu)之k-核解析及其建模[D];東北大學(xué);2009年

10 郭正彪;大尺度在線社會(huì)網(wǎng)絡(luò)結(jié)構(gòu)研究[D];華中科技大學(xué);2012年

相關(guān)碩士學(xué)位論文前10條

1 王永春;負(fù)相容范式中不同質(zhì)拓?fù)浣Y(jié)構(gòu)刺激間的客體更新[D];陜西師范大學(xué);2015年

2 李健;面向高速鐵路監(jiān)測(cè)的WSNs拓?fù)浣Y(jié)構(gòu)分析[D];山西大學(xué);2015年

3 張濤濤;熱/流均衡的混合型3D NoC拓?fù)浣Y(jié)構(gòu)設(shè)計(jì)與映射算法研究[D];南京航空航天大學(xué);2014年

4 張大維;多工器的綜合與設(shè)計(jì)[D];電子科技大學(xué);2014年

5 雷斐;高階互連網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)的設(shè)計(jì)與分析[D];國(guó)防科學(xué)技術(shù)大學(xué);2013年

6 陸磊;無重疊視域多攝像機(jī)目標(biāo)跟蹤若干問題研究[D];合肥工業(yè)大學(xué);2015年

7 李丹琳;基于企業(yè)網(wǎng)點(diǎn)的拓?fù)浣Y(jié)構(gòu)應(yīng)用研究[D];浙江工業(yè)大學(xué);2014年

8 安婷;基于植株圖像的拓?fù)浣Y(jié)構(gòu)提取方法研究[D];西北農(nóng)林科技大學(xué);2016年

9 李佳妮;基于拓?fù)浣Y(jié)構(gòu)的軟件執(zhí)行過程安全加固技術(shù)的研究[D];北京理工大學(xué);2016年

10 郭高攀;低壓大功率并網(wǎng)變流器的研制[D];天津工業(yè)大學(xué);2016年

本文編號(hào)：2453567

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2453567.html

上一篇：搜索引擎技術(shù)的突破——知識(shí)化搜索
下一篇：基于OAI-PMH的元數(shù)據(jù)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于海量互聯(lián)網(wǎng)網(wǎng)頁拓?fù)浣Y(jié)構(gòu)的作弊鏈接與惡意網(wǎng)頁挖掘