當(dāng)前位置：主頁 > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

社交媒體數(shù)據(jù)上的時(shí)態(tài)關(guān)鍵詞查詢

發(fā)布時(shí)間：2018-08-05 19:33

【摘要】：社交媒體服務(wù)已經(jīng)成為人們?nèi)粘Ｊ褂米铑l繁的互聯(lián)網(wǎng)服務(wù)之一,它記錄了用戶發(fā)布的原創(chuàng)內(nèi)容、轉(zhuǎn)發(fā)與評論。隨著數(shù)據(jù)的不斷累積,這些長時(shí)間跨度的數(shù)據(jù)對于研究用戶的集群行為、全面理解人或者事件都具有重要意義。關(guān)鍵詞查詢由于簡便易用也被用于從海量的社交媒體數(shù)據(jù)中查詢相關(guān)信息。用戶為了追蹤事件的發(fā)展,會(huì)頻繁提交同一查詢以獲得事件最新消息；分析人員為了徹底地了解分析對象,需要收集不同時(shí)期的數(shù)據(jù)。然而現(xiàn)有的社交媒體搜索服務(wù)和研究工作主要關(guān)注實(shí)時(shí)搜索,信息中記錄的發(fā)布時(shí)間也僅被用于衡量數(shù)據(jù)的時(shí)效性。本文使用社交媒體數(shù)據(jù)流模型對原創(chuàng)內(nèi)容及其轉(zhuǎn)發(fā)和評論進(jìn)行建模,為每個(gè)社交對象定義其引用時(shí)間序列�；谠撃Ｐ�,時(shí)態(tài)關(guān)鍵詞查詢使用關(guān)鍵詞作為查詢的內(nèi)容約束,以時(shí)間序列數(shù)據(jù)在查詢時(shí)間范圍內(nèi)的和作為相應(yīng)打分函數(shù)的輸入,選出分值最大的k條社交對象。本文將時(shí)間提升為查詢的一個(gè)約束條件,以此用一種查詢同時(shí)滿足實(shí)時(shí)追蹤與分析探索兩類應(yīng)用場景。隨后從離線索引下可利用的社交媒體數(shù)據(jù)特點(diǎn)以及在線索引時(shí)需要面臨的索引更新效率兩個(gè)角度出發(fā),分別提出了針對這一查詢的索引技術(shù)與查詢算法。最后,本文基于時(shí)間序列數(shù)據(jù)分析了新浪微博興衰背后信息傳播的變化,也基于實(shí)時(shí)的社交媒體數(shù)據(jù)流構(gòu)建了一個(gè)在線的微博分析平臺(tái),它們構(gòu)成了時(shí)態(tài)關(guān)鍵詞查詢的應(yīng)用示例。全文圍繞著時(shí)態(tài)關(guān)鍵詞查詢這一問題展開,主要貢獻(xiàn)包括以下三方面：·設(shè)計(jì)了基于社交媒體數(shù)據(jù)特點(diǎn)的雙層索引結(jié)構(gòu)以及分段最大近似摘要。一方面,社交對象的引用樹在規(guī)模與生命周期長度的分布上都服從長尾分布。另一方面，社交對象也往往只在某些時(shí)間段內(nèi)保持熱門,在其余很長的時(shí)間內(nèi)都極少被關(guān)注。本文基于社交媒體數(shù)據(jù)上的以上兩個(gè)特點(diǎn),分別設(shè)計(jì)了雙層倒排列表結(jié)構(gòu)以及分段最大近似摘要。其中,雙層倒排列表結(jié)構(gòu)使用不同的索引結(jié)構(gòu)分別管理熱門對象和普通對象,兩種結(jié)構(gòu)都支持從時(shí)間維度過濾數(shù)據(jù),并按照社交對象最終引用樹大小的逆序返回?cái)?shù)據(jù)。通過基于引用樹大小長尾分布的理論分析,本文揭示了使用該索引的查詢算法需要訪問數(shù)據(jù)量的上界。真實(shí)數(shù)據(jù)集上的統(tǒng)計(jì)分析結(jié)果表明,大部分情況下算法訪問數(shù)據(jù)量的上界隨k值成亞線性的關(guān)系。本文進(jìn)一步提出了分段最大近似摘要,它能夠更加準(zhǔn)確地預(yù)估每個(gè)對象在查詢窗口內(nèi)引用樹大小的上界,從而避免計(jì)算查詢窗口內(nèi)處于非熱門狀態(tài)的熱門對象的實(shí)際分值所產(chǎn)生的磁盤訪問�！ぬ岢隽私鉀Q實(shí)時(shí)時(shí)態(tài)關(guān)鍵詞查詢的日志結(jié)構(gòu)八叉樹索引。社交媒體數(shù)據(jù)的另一個(gè)特征是用戶數(shù)據(jù)的高速生成,這一現(xiàn)象在熱點(diǎn)事件期間顯得尤為突出。因此面對在線索引場景時(shí),快速索引這些數(shù)據(jù)并及時(shí)將其反映到查詢結(jié)果中,無論對提升普通用戶的用戶體驗(yàn),還是為快速?zèng)Q策提供及時(shí)的數(shù)據(jù)支持,都具有重要意義。本文將每個(gè)社交對象的引用時(shí)間序列近似得到的近似段數(shù)據(jù)映射至三維空間中的點(diǎn),并利用八叉樹同時(shí)保持了索引中社交對象在重要性與時(shí)間維度上的局部性。八叉樹節(jié)點(diǎn)對應(yīng)的編碼方法使得索引既支持了時(shí)間維度的數(shù)據(jù)過濾,也保證了時(shí)態(tài)閾值算法所需要的數(shù)據(jù)返回順序。而與日志結(jié)構(gòu)合并樹的結(jié)合,充分利用了內(nèi)存訪問的快速與磁盤順序讀寫的高效,實(shí)現(xiàn)了社交媒體數(shù)據(jù)的快速索引�！だ脮r(shí)態(tài)關(guān)鍵詞查詢實(shí)現(xiàn)了基于海量與實(shí)時(shí)的社交媒體數(shù)據(jù)上的分析應(yīng)用。本文基于170萬用戶群體在大約5年內(nèi)的全量微博行為數(shù)據(jù),分析了新浪微博興衰背后信息傳播的變化。時(shí)態(tài)關(guān)鍵詞查詢在這一分析過程中被用于提升數(shù)據(jù)抽取規(guī)則的準(zhǔn)確性,有助于覆蓋更加全面的數(shù)據(jù)。通過從單條微博轉(zhuǎn)發(fā)時(shí)間序列的建模出發(fā),提出了使用對數(shù)高斯模型對一組微博的轉(zhuǎn)發(fā)模型參數(shù)進(jìn)行擬合的方法,并指出了與信息傳播速度相關(guān)的一個(gè)統(tǒng)計(jì)量。本文進(jìn)一步定義了用戶在新浪微博平臺(tái)上的各種行為特征,以及反映整個(gè)網(wǎng)絡(luò)用戶對各社交平臺(tái)態(tài)度的外部特征,分析了它們的變化趨勢并且探索它們與反映信息傳播的統(tǒng)計(jì)量之間的關(guān)系。本文最后將全文相關(guān)的技術(shù)系統(tǒng)化,構(gòu)造了一個(gè)基于新浪微博的實(shí)時(shí)微博數(shù)據(jù)流的在線分析平臺(tái)。它能夠?qū)r(shí)態(tài)關(guān)鍵詞查詢檢索的結(jié)果聚類成話題,并從多個(gè)維度展示話題的初步統(tǒng)計(jì)分析結(jié)果。綜上所述,本文擴(kuò)展了社交媒體數(shù)據(jù)上已有的關(guān)鍵詞查詢功能,提出了時(shí)態(tài)關(guān)鍵詞查詢,并從社交媒體數(shù)據(jù)的數(shù)據(jù)特點(diǎn)以及索引的更新效率兩個(gè)方面探索了索引的組織結(jié)構(gòu)以及查詢算法。以該查詢?yōu)榛A(chǔ)的兩個(gè)分析應(yīng)用表明,它能夠更加靈活地適應(yīng)各類應(yīng)用場景,有助于用戶從社交媒體數(shù)據(jù)中發(fā)掘重要信息,為后續(xù)展開更加復(fù)雜的分析任務(wù)提供了數(shù)據(jù)基礎(chǔ)。本文最后構(gòu)建的公開可訪問的系統(tǒng)實(shí)現(xiàn)了文中的索引與分析技術(shù),使各領(lǐng)域的研究人員以及分析人員能受益于海量實(shí)時(shí)的社交媒體數(shù)據(jù)。
[Abstract]:Social media services have become one of the most frequent Internet services used in people's daily use. It records the original content, forwarded and commented by users. With the continuous accumulation of data, these long - span data are of great significance to the study of the user's cluster behavior and the overall understanding of people or events. In order to track events, users will frequently submit the same query in order to get the latest news of the event. In order to understand the object thoroughly, the analyst needs to collect data at different times. However, the existing social media search service and research Work is mainly focused on real-time search, and the release time recorded in information is also used to measure the timeliness of data. This paper uses social media data flow model to model original content, forward and comment, and defines its reference time series for each social object. Based on this model, keyword query uses keywords as a check. In this paper, the time series data in the query time range and the input of the corresponding scoring function are selected to select the largest K social object with the maximum value. In this paper, the time is promoted to a constraint condition of the query. In this paper, two kinds of application scenarios are explored with a query and real-time tracking and analysis. Then the offline index is followed by an offline index. The characteristics of the available social media data and the index update efficiency of the online index are two points of view. The index technology and query algorithm for this query are proposed. Finally, based on the time series data, this paper analyses the change of information propagation behind the rise and fall of sina micro-blog, and also based on the real time social media number. According to the stream, an online micro-blog analysis platform is built, which constitute an example of the application of temporal keyword query. The full text is carried out around the question of temporal keyword query. The main contributions include the following three aspects:. The design of a double index structure based on the characteristics of social media data and the maximum approximate summary. The reference tree of the intersection obeys the long tail distribution in the size and life cycle length. On the other hand, the social objects are often kept hot in some time periods, and are rarely concerned for the rest of the long time. This paper designs a double inverted list structure based on the above two characteristics of social media data. The double inverted list structure uses different index structures to manage the hot objects and ordinary objects respectively. The two structures all support the filtering of data from the time dimension and return the data according to the reverse order of the social object's final reference tree size. This paper reveals that the query algorithm using the index needs to access the upper bound of the amount of data. The statistical analysis on the real data set shows that the upper bound of the number of access data is sublinear with the K value in most cases. This paper further proposes a piecewise maximum approximate summary, which can predict each object more accurately in the query window. The upper boundary of the tree size is quoted in order to avoid the disk access generated by the actual value of a hot object in a non hot state. A log structure octree index is proposed to solve the real-time temporal keyword query. The other feature of social media data is the high-speed generation of user data, which is a phenomenon. It is particularly prominent during hot events. Therefore, it is important to quickly index the data and reflect it to the query results in the face of an online index scene, whether to improve the user experience of the ordinary user, or to provide timely data support for the quick decision. This article introduces the reference time series of each social object. The approximate approximate segment data is mapped to the point in the three-dimensional space, and the octree is used to maintain the locality in the importance and time dimension of the social object in the index. The encoding method of the octree node makes the index not only support the data filtering of the time dimension, but also guarantee the return of the data required by the temporal threshold algorithm. The combination of the merging tree with the log structure, fully utilizing the fast and disk sequence read-write efficiency of the memory access, implements the rapid index of social media data. In the full volume micro-blog behavior data, the change of information propagation behind the rise and fall of sina micro-blog is analyzed. The temporal keyword query is used to improve the accuracy of the data extraction rules in this analysis process and help to cover more comprehensive data. The logarithmic Gauss model is proposed by using the modeling of a single micro-blog forwarding time sequence. Based on the method of fitting the parameters of a group of micro-blog forwarding models, this paper points out a statistic related to the speed of information propagation. This paper further defines the behavior characteristics of the users on the Sina micro-blog platform, as well as the external characteristics that reflect the attitude of the entire network users to the social platforms, and analyzes their changing trends. And explore the relationship between them and the statistics reflecting the information dissemination. Finally, this paper systematized the full text related technology and constructed an online analysis platform of real-time micro-blog data stream based on Sina micro-blog. It can cluster the results of the temporal keyword search search into a topic, and display the preliminary statistics of the topic from several dimensions. In summary, this paper extends the function of keyword search on social media data, proposes temporal keyword query, and explores the organization structure and query arithmetic of index from two aspects of social media data characteristics and index updating efficiency. Two analysis applications based on this query It can be more flexible to adapt to various application scenarios, help users excavate important information from social media data, and provide data base for further complex analysis tasks. The open access system at the end of this paper implements the index and analysis technology in the text, and makes researchers and analysts in various fields. People can benefit from massive real-time social media data.
【學(xué)位授予單位】：華東師范大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP391.3;TP393.09

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 梁銀;董永權(quán);;基于對象集合的空間關(guān)鍵詞查詢[J];計(jì)算機(jī)應(yīng)用;2014年07期

2 張穎;李昕;;一種關(guān)系數(shù)據(jù)庫上的關(guān)鍵詞查詢排序方法[J];遼寧工業(yè)大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年05期

3 寇蘇玲;蔡慶生;;應(yīng)用于用戶興趣建模的多文本關(guān)鍵詞抽取研究[J];計(jì)算機(jī)仿真;2007年02期

4 林子雨;楊冬青;王騰蛟;張東站;;基于關(guān)系數(shù)據(jù)庫的關(guān)鍵詞查詢[J];軟件學(xué)報(bào);2010年10期

5 林子雨;鄒權(quán);賴永炫;林琛;;關(guān)系數(shù)據(jù)庫中的關(guān)鍵詞查詢結(jié)果動(dòng)態(tài)優(yōu)化[J];軟件學(xué)報(bào);2014年03期

6 李益民;;一種大規(guī)模Deep Web查詢重構(gòu)技術(shù)[J];情報(bào)科學(xué);2014年01期

7 李慧穎;瞿裕忠;;基于關(guān)鍵詞的RDF數(shù)據(jù)查詢方法[J];東南大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年02期

8 楊書新;徐慧琴;;基于數(shù)據(jù)圖的關(guān)系數(shù)據(jù)庫關(guān)鍵詞查詢排序研究[J];計(jì)算機(jī)應(yīng)用研究;2014年02期

9 海沫;郭樹行;;網(wǎng)絡(luò)環(huán)境中基于語義聚類的多關(guān)鍵詞查詢機(jī)制[J];圖書情報(bào)工作;2012年20期

10 安鎮(zhèn)宙;楊鑒;仇汶;;一種新的基于分層查詢表的關(guān)鍵詞識(shí)別模型[J];計(jì)算機(jī)工程與應(yīng)用;2008年02期

相關(guān)會(huì)議論文前3條

1 修慧蘭;;臺(tái)灣大學(xué)生個(gè)人競爭力之相關(guān)研究[A];全國教育與心理統(tǒng)計(jì)與測量學(xué)術(shù)年會(huì)暨第八屆海峽兩岸心理與教育測驗(yàn)學(xué)術(shù)研討會(huì)論文摘要集[C];2008年

2 楊艷;何天宇;;基于短語的關(guān)系數(shù)據(jù)庫關(guān)鍵詞查詢方法[A];第29屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（B輯）（NDBC2012）[C];2012年

3 李_，

本文編號(hào)：2166788

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2166788.html

上一篇：高速網(wǎng)絡(luò)中TCP擁塞控制研究
下一篇：移動(dòng)云服務(wù)架構(gòu)設(shè)計(jì)與計(jì)算卸載策略研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

社交媒體數(shù)據(jù)上的時(shí)態(tài)關(guān)鍵詞查詢