面向開源軟件的聚類搜索系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
本文選題:開源軟件 + 聚類搜索; 參考:《國(guó)防科學(xué)技術(shù)大學(xué)》2012年碩士論文
【摘要】:利用開源軟件來(lái)提高軟件的開發(fā)效率和質(zhì)量,已成為在軟件工程領(lǐng)域的重要發(fā)展趨勢(shì)。隨著開源軟件的快速發(fā)展和廣泛應(yīng)用,互聯(lián)網(wǎng)上出現(xiàn)了大量面向開源軟件開發(fā)和共享的開源社區(qū)。目前,,種類繁多、數(shù)量巨大的開源軟件廣泛分布于互聯(lián)網(wǎng)的眾多開源社區(qū),這對(duì)開源軟件的搜索和選擇帶來(lái)嚴(yán)峻挑戰(zhàn)。如何自動(dòng)收集和檢索互聯(lián)網(wǎng)開源社區(qū)中的海量開源數(shù)據(jù),并對(duì)檢索到的數(shù)據(jù)結(jié)果進(jìn)行聚類分析,為用戶提供一種面向開源軟件的跨社區(qū)聚類搜索服務(wù),是具有重要研究和實(shí)踐價(jià)值的課題。 本文深入分析了搜索引擎和聚類搜索相關(guān)技術(shù),針對(duì)開源軟件數(shù)據(jù)在互聯(lián)網(wǎng)上的分布規(guī)律和數(shù)據(jù)特點(diǎn),設(shè)計(jì)了面向開源社區(qū)數(shù)據(jù)爬取、屬性抽取與索引、搜索結(jié)果聚類分析的開源軟件搜索系統(tǒng)Influx,能夠有效支持開源軟件的跨社區(qū)聚類搜索。本文的工作主要包括: 首先,本文對(duì)搜索引擎和聚類搜索相關(guān)技術(shù)進(jìn)行了比較分析,針對(duì)開源社區(qū)搜索系統(tǒng)的特殊需求,提出一種面向開源軟件的聚類搜索系統(tǒng)體系結(jié)構(gòu)Influx,將此類聚類搜索系統(tǒng)結(jié)構(gòu)劃分為數(shù)據(jù)存儲(chǔ)、數(shù)據(jù)檢索、數(shù)據(jù)分析和數(shù)據(jù)訪問四個(gè)層次,具有良好可擴(kuò)展性。 其次,設(shè)計(jì)了開源軟件聚類搜索系統(tǒng)的信息檢索機(jī)制和聚類分析機(jī)制。其中,基于Heritrix和Lucene平臺(tái)設(shè)計(jì)了高效的開源軟件信息爬取、信息抽取和屬性索引機(jī)制,基于K-means算法設(shè)計(jì)一種改良的搜索結(jié)果聚類機(jī)制,以供用戶選擇性的瀏覽搜索結(jié)果。 最后,實(shí)現(xiàn)了面向開源軟件的搜索系統(tǒng)Influx并進(jìn)行了實(shí)驗(yàn),對(duì)系統(tǒng)功能和性能進(jìn)行了驗(yàn)證。實(shí)驗(yàn)結(jié)果表明,Influx搜索系統(tǒng)能夠有效支持在互聯(lián)網(wǎng)范圍進(jìn)行跨社區(qū)開源軟件搜索和搜索結(jié)果的聚類分析。
[Abstract]:The use of open source software to improve the efficiency and quality of software development has become an important development trend in the field of software engineering. With the rapid development and wide application of open source software, there are a large number of open source communities for open source software development and sharing on the Internet. At present, a wide variety of open source software is widely distributed in many open source communities on the Internet, which brings serious challenges to the search and selection of open source software. How to automatically collect and retrieve the massive open source data in the open source community of the Internet, and analyze the result of the data retrieval, so as to provide users with a cross-community clustering search service oriented to open source software. Is an important research and practical value of the subject. In this paper, the related technologies of search engine and clustering search are deeply analyzed. According to the distribution rule and data characteristics of open source software data on the Internet, this paper designs a method for data crawling, attribute extraction and indexing in open source community. Influx, an open source software search system for clustering analysis of search results, can effectively support cross-community clustering search of open source software. The work of this paper mainly includes: First of all, this paper makes a comparative analysis of search engine and cluster search technology, aiming at the special needs of open source community search system. A cluster search system architecture named Influx for open source software is proposed. The cluster search system is divided into four levels: data storage, data retrieval, data analysis and data access. Secondly, the information retrieval mechanism and clustering analysis mechanism of open source clustering search system are designed. Among them, based on Heritrix and Lucene platform, an efficient open source software information crawling, information extraction and attribute indexing mechanism is designed. Based on K-means algorithm, an improved search result clustering mechanism is designed for users to browse search results selectively. Finally, the open source software oriented search system Influx is implemented and tested, and the function and performance of the system are verified. The experimental results show that the Influx search system can effectively support cross-community open source software search and clustering analysis of search results.
【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP311.52
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 趙洋;滕桂法;張玉新;何冬梅;;基于Internet的農(nóng)業(yè)信息垂直搜索引擎的設(shè)計(jì)[J];河北農(nóng)業(yè)大學(xué)學(xué)報(bào);2009年06期
2 魯明羽;姚曉娜;魏善嶺;;基于模糊聚類的網(wǎng)絡(luò)論壇熱點(diǎn)話題挖掘[J];大連海事大學(xué)學(xué)報(bào);2008年04期
3 劉輝,葉紹志,黃暉,李星;基于搜索引擎的IPv6網(wǎng)絡(luò)分析[J];電信科學(xué);2002年03期
4 謝欣,劉菲菲,李曉明;天網(wǎng)千帆——一種新型文件搜索引擎[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年S1期
5 朱岸青;黃杰;;基于Lucene的全文檢索系統(tǒng)模型的研究和開發(fā)[J];暨南大學(xué)學(xué)報(bào)(自然科學(xué)與醫(yī)學(xué)版);2009年05期
6 李曉麗;杜振龍;;基于Lucence的個(gè)性化搜索引擎研究[J];計(jì)算機(jī)工程;2010年19期
7 熊瑞萍;萬(wàn)江平;;開源軟件的突圍之路——關(guān)于開源運(yùn)動(dòng)的若干思考[J];科技管理研究;2009年03期
8 李丹;顧保磊;;基于Heritrix的內(nèi)容搜索引擎系統(tǒng)[J];軟件導(dǎo)刊;2010年04期
9 楊頌;歐陽(yáng)柳波;;基于Heritrix的面向電子商務(wù)網(wǎng)站增量爬蟲研究[J];軟件導(dǎo)刊;2010年07期
10 曹紅兵;;新一代搜索引擎UJIK0[J];圖書館建設(shè);2007年02期
本文編號(hào):1916218
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1916218.html