AdaTextTiling：一種基于TextTiling算法改進(jìn)的自適應(yīng)文本分割技術(shù)

發(fā)布時(shí)間：2018-03-23 04:21

本文選題：文本分割　切入點(diǎn)：TextTiling算法　出處：《華東師范大學(xué)》2017年碩士論文　論文類型：學(xué)位論文

【摘要】：伴隨著計(jì)算機(jī)在日常生活中的逐漸普及,整個(gè)社會(huì)的信息科技得到迅猛地發(fā)展,互聯(lián)網(wǎng)信息化程度也不斷提高。通過互聯(lián)網(wǎng),人們可以非常方便的掌握世界范圍內(nèi)的各種信息,以及與各個(gè)地方的人進(jìn)行信息互動(dòng),可以說互聯(lián)網(wǎng)帶給我們更加便捷、高效的信息化生活。而人們的頻繁使用也帶來了龐大的互聯(lián)網(wǎng)數(shù)據(jù)資源,這些數(shù)據(jù)資源蘊(yùn)含著極高的挖掘價(jià)值,其中文本數(shù)據(jù)就是占比非常大的互聯(lián)網(wǎng)數(shù)據(jù)資源,而文本挖掘就是從豐富的文本數(shù)據(jù)資源中挖掘出有價(jià)值的信息。文本分割作為文本挖掘的一個(gè)重要分支,在文本信息挖掘方面也有相當(dāng)重要的作用。文本分割指的是將一整篇文本看成由多個(gè)子主題文本段組成的,然后運(yùn)用相關(guān)方法將一整篇文本分割成多個(gè)分割片段,每個(gè)文本片段都對(duì)應(yīng)著一個(gè)子主題。文本分割的算法有很多,TextTiling算法就是比較經(jīng)典的文本分割算法。本文主要是對(duì)經(jīng)典的TextTiling算法進(jìn)行改進(jìn),提出分割性能更好的AdaTextTil-ing算法,用于更好地對(duì)文本進(jìn)行分割。本文首先是對(duì)TextTiling算法進(jìn)行分析,掌握TextTiling算法的算法原理以及分析算法的不足之處,接著是進(jìn)行優(yōu)化,其中主要的一點(diǎn)是計(jì)算潛在分割點(diǎn)兩邊文本相似度時(shí)會(huì)靈活地調(diào)整文本窗口長(zhǎng)度,因?yàn)楸疚恼J(rèn)為每一個(gè)潛在分割點(diǎn)的最優(yōu)文本窗口長(zhǎng)度不是固定不變的。同時(shí),本文還對(duì)TextTiling算法實(shí)現(xiàn)上的計(jì)算邏輯進(jìn)行分析優(yōu)化,提高算法計(jì)算效率,并在此基礎(chǔ)上結(jié)合LDA主題模型進(jìn)一步優(yōu)化。最后通過實(shí)驗(yàn),本文發(fā)現(xiàn)AdaTextTiling算法性能上要明顯優(yōu)于TextTiling算法,從而說明了 AdaTextTiling算法的有效性。
[Abstract]:With the gradual popularization of computers in daily life, the information technology of the whole society has been developed rapidly, and the information level of the Internet has been continuously improved. Through the Internet, people can easily grasp all kinds of information in the world. And information interaction with people in various places, it can be said that the Internet has brought us more convenient and efficient information life. And the frequent use of people has also brought huge Internet data resources. These data resources contain very high mining value, in which text data is a very large proportion of Internet data resources. Text mining is to mine valuable information from rich text data resources. Text segmentation is an important branch of text mining. Text segmentation means that a whole text is considered to be composed of multiple sub-topic text segments, and then the whole text is divided into multiple segmented segments by using relevant methods. Each text fragment corresponds to a subtopic. There are many text segmentation algorithms, which are the classical text segmentation algorithm. This paper mainly improves the classical TextTiling algorithm and proposes a better AdaTextTil-ing algorithm with better segmentation performance. This paper is to analyze the TextTiling algorithm, master the principle of the TextTiling algorithm and analyze the shortcomings of the algorithm, and then optimize the algorithm. The main point is that the text window length can be adjusted flexibly when calculating the text similarity between two potential segmentation points, because the optimal text window length of each potential segmentation point is not fixed. At the same time, This paper also analyzes and optimizes the computational logic in the implementation of TextTiling algorithm, improves the efficiency of the algorithm, and further optimizes the algorithm combined with the LDA topic model. Finally, through experiments, it is found that the performance of AdaTextTiling algorithm is obviously better than that of TextTiling algorithm. Thus, the effectiveness of AdaTextTiling algorithm is illustrated.
【學(xué)位授予單位】：華東師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前7條

1 程志華;倪時(shí)龍;黃文思;龔賀;;企業(yè)級(jí)非結(jié)構(gòu)化數(shù)據(jù)管理平臺(tái)研究及實(shí)踐[J];電力信息化;2012年03期

2 何佳;周長(zhǎng)勝;石顯鋒;;網(wǎng)絡(luò)輿情監(jiān)控系統(tǒng)的實(shí)現(xiàn)方法[J];鄭州大學(xué)學(xué)報(bào)(理學(xué)版);2010年01期

3 石晶;范猛;李萬龍;;基于LDA模型的主題分析[J];自動(dòng)化學(xué)報(bào);2009年12期

4 石晶;胡明;石鑫;戴國(guó)忠;;基于LDA模型的文本分割[J];計(jì)算機(jī)學(xué)報(bào);2008年10期

5 朱靖波;葉娜;羅海濤;;基于多元判別分析的文本分割模型[J];軟件學(xué)報(bào);2007年03期

6 石晶;戴國(guó)忠;;基于PLSA模型的文本分割[J];計(jì)算機(jī)研究與發(fā)展;2007年02期

7 秦兵,劉挺,李生;多文檔自動(dòng)文摘綜述[J];中文信息學(xué)報(bào);2005年06期

相關(guān)博士學(xué)位論文前1條

1 葉娜;文本分割關(guān)鍵技術(shù)及其在多文檔摘要中的應(yīng)用研究[D];東北大學(xué);2008年

相關(guān)碩士學(xué)位論文前4條

1 李效晉;基于統(tǒng)計(jì)模型的文本分割方法及其改進(jìn)[D];山東大學(xué);2014年

2 康東;中文文本挖掘基本理論與應(yīng)用[D];蘇州大學(xué);2014年

3 王漪;文本挖掘技術(shù)的研究及其在教學(xué)平臺(tái)中的應(yīng)用[D];北京交通大學(xué);2014年

4 王允;網(wǎng)絡(luò)輿情數(shù)據(jù)獲取與話題分析技術(shù)研究[D];解放軍信息工程大學(xué);2010年

，

本文編號(hào)：1651881

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1651881.html

上一篇：特征匹配算法魯棒性與速度的對(duì)比分析
下一篇：基于BIM的裝配整體式混凝土梁的拆分研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

AdaTextTiling：一種基于TextTiling算法改進(jìn)的自適應(yīng)文本分割技術(shù)