基于電商數(shù)據(jù)和用戶行為的信息抽取
發(fā)布時間:2018-10-15 11:47
【摘要】:隨著互聯(lián)網(wǎng)和電子商務(wù)在中國的爆炸式發(fā)展,以阿里巴巴為首的電子商務(wù)公司,正在產(chǎn)生海量的數(shù)據(jù)并吸引數(shù)以億計的用戶。換言之,大數(shù)據(jù)時代正在步步逼近,面對海量的數(shù)據(jù),怎樣提高數(shù)據(jù)利用率,怎樣提取用戶最想要的,最有價值的信息是核心價值的問題。在電子商務(wù)這塊戰(zhàn)斗在互聯(lián)網(wǎng)產(chǎn)業(yè)最前沿的陣地上,尤其需要快速完成從數(shù)據(jù)到信息的轉(zhuǎn)化。這就是本文要研究的信息抽取(information extraction)問題,尤其專注于電子商務(wù)領(lǐng)域,F(xiàn)有的信息抽取技術(shù)主要包括命名實體識別(Named Entity Recognition)和關(guān)系抽取(Relation Extraction)。命名實體識別現(xiàn)在主要有以下技術(shù)方法:基于規(guī)則和詞典的方法、基于統(tǒng)計的方法、二者混合的方法等。其中基于規(guī)則和詞典的方法,在有針對性的優(yōu)化規(guī)則的基礎(chǔ)上,準(zhǔn)確率很高,但是人力成本較大,可復(fù)用和可擴展性不強,往往只能解決某些特定的應(yīng)用場景;诮y(tǒng)計的方法準(zhǔn)確率和召回率往往不盡如人意,算法復(fù)雜度也較高,但是可擴展性強,進步空間很大,大量學(xué)者致力于改進數(shù)學(xué)統(tǒng)計模型,以達到更高的準(zhǔn)確率和召回率,從而真正實現(xiàn)機器智能識別。經(jīng)典的命名實體識別模型有HMM(隱馬爾科夫模型),ME-HMM(最大熵隱馬爾科夫模型),CRF(條件隨機場)等。關(guān)系抽取是從海量語料中分析抽取命名實體之間的關(guān)系,比如地名與機構(gòu)名之間的從屬關(guān)系,物品名之間的相似關(guān)系,各種簡稱與全稱之間的同義關(guān)系等。同時,信息抽取是一個應(yīng)用性很強的領(lǐng)域,理論算法必須要形成系統(tǒng)實現(xiàn),才能準(zhǔn)確評定算法模型的效果。但是,現(xiàn)在流行的信息抽取系統(tǒng)有華盛頓大學(xué)領(lǐng)導(dǎo)開發(fā)的OPENIE系列軟件包,只能應(yīng)用于英文信息抽取,F(xiàn)在迫切需要一種高效使用的中心信息抽取系統(tǒng)。本文的主要貢獻為:1)介紹了經(jīng)典的信息抽取模型,分別是命名實體識別領(lǐng)域的HMM,ME-HMM,CRF等,近義詞關(guān)系抽取領(lǐng)域的詞向量模型。同時還介紹了信息抽取任務(wù)常用的評價指標(biāo)準(zhǔn)確率,召回率和F值。2)基于經(jīng)典的命名實體識別模型——隱馬爾科夫模型做了針對于電子商務(wù)數(shù)據(jù)的優(yōu)化,提出了一種基于詞匯的隱馬爾科夫模型(Lexical-HMM),提升了模型對于電商應(yīng)用場景下,對于命名實體識別的準(zhǔn)確率。對于近義詞關(guān)系抽取,則提出了一種基于用戶搜索和瀏覽行為的二部圖模型,可以高效準(zhǔn)確的抽取實體近義關(guān)系,并做了對比實驗,證明了算法效果。3)設(shè)計并驗證了本文提出的信息抽取系統(tǒng);赟park平臺和人工訓(xùn)練集,采用DAG的設(shè)計方式,可以高效準(zhǔn)確地從輸入數(shù)據(jù)從抽取命名實體庫和近義詞庫,并驗證了系統(tǒng)的效率和穩(wěn)定性。
[Abstract]:With the explosive development of the Internet and e-commerce in China, e-commerce companies led by Alibaba are generating huge amounts of data and attracting hundreds of millions of users. In other words, big data era is approaching step by step, facing the massive data, how to improve the utilization rate of data, how to extract what users want most, the most valuable information is the core value problem. In the battle of e-commerce, which is at the forefront of the Internet industry, the transition from data to information is particularly needed. This is the problem of information extraction (information extraction), especially in the field of e-commerce. The existing information extraction techniques mainly include named entity identification (Named Entity Recognition) and relational extraction (Relation Extraction). The methods of named entity recognition are as follows: based on rules and dictionaries, based on statistics, and mixed with each other. The methods based on rules and dictionaries have high accuracy on the basis of targeted optimization rules, but the human costs are high, the reusability and expansibility are not strong, so they can only solve some specific application scenarios. The accuracy and recall rate of the methods based on statistics are often not satisfactory, the algorithm complexity is also high, but the expansibility is strong, the improvement space is very big, a large number of scholars devote themselves to improving the mathematical statistical model, in order to achieve higher accuracy and recall rate. Thus the machine intelligent recognition is realized. Classical named entity recognition models include HMM (Hidden Markov Model) and ME-HMM (maximum Entropy Hidden Markov Model), CRF (conditional Random Field). Relational extraction is to analyze and extract the relations between named entities from massive corpus, such as the subordinate relationship between place names and agency names, the similar relations between object names, the synonyms between various abbreviations and full names, and so on. At the same time, information extraction is a very applicable field, theoretical algorithm must form a system to achieve, in order to accurately evaluate the effectiveness of the algorithm model. However, the popular information extraction system has a series of OPENIE software packages developed by the University of Washington, which can only be applied to English information extraction. There is an urgent need for an efficient central information extraction system. The main contributions of this paper are as follows: 1) the classical information extraction model, named entity recognition (HMM,ME-HMM,CRF), and the word vector model of synonym relation extraction are introduced. At the same time, the paper also introduces the evaluation index accuracy, recall rate and F value. 2) based on the classical named entity recognition model, hidden Markov model, the paper optimizes the data of electronic commerce. A lexical based Hidden Markov Model (Lexical-HMM) is proposed to improve the accuracy of the model for the recognition of named entities in the context of e-commerce applications. For synonym extraction, a bipartite graph model based on user search and browsing behavior is proposed, which can extract entity synonyms efficiently and accurately. The algorithm effect is proved. 3) the information extraction system proposed in this paper is designed and validated. Based on Spark platform and artificial training set, the named entity library and synonym library can be extracted from input data efficiently and accurately by using DAG design method, and the efficiency and stability of the system are verified.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
[Abstract]:With the explosive development of the Internet and e-commerce in China, e-commerce companies led by Alibaba are generating huge amounts of data and attracting hundreds of millions of users. In other words, big data era is approaching step by step, facing the massive data, how to improve the utilization rate of data, how to extract what users want most, the most valuable information is the core value problem. In the battle of e-commerce, which is at the forefront of the Internet industry, the transition from data to information is particularly needed. This is the problem of information extraction (information extraction), especially in the field of e-commerce. The existing information extraction techniques mainly include named entity identification (Named Entity Recognition) and relational extraction (Relation Extraction). The methods of named entity recognition are as follows: based on rules and dictionaries, based on statistics, and mixed with each other. The methods based on rules and dictionaries have high accuracy on the basis of targeted optimization rules, but the human costs are high, the reusability and expansibility are not strong, so they can only solve some specific application scenarios. The accuracy and recall rate of the methods based on statistics are often not satisfactory, the algorithm complexity is also high, but the expansibility is strong, the improvement space is very big, a large number of scholars devote themselves to improving the mathematical statistical model, in order to achieve higher accuracy and recall rate. Thus the machine intelligent recognition is realized. Classical named entity recognition models include HMM (Hidden Markov Model) and ME-HMM (maximum Entropy Hidden Markov Model), CRF (conditional Random Field). Relational extraction is to analyze and extract the relations between named entities from massive corpus, such as the subordinate relationship between place names and agency names, the similar relations between object names, the synonyms between various abbreviations and full names, and so on. At the same time, information extraction is a very applicable field, theoretical algorithm must form a system to achieve, in order to accurately evaluate the effectiveness of the algorithm model. However, the popular information extraction system has a series of OPENIE software packages developed by the University of Washington, which can only be applied to English information extraction. There is an urgent need for an efficient central information extraction system. The main contributions of this paper are as follows: 1) the classical information extraction model, named entity recognition (HMM,ME-HMM,CRF), and the word vector model of synonym relation extraction are introduced. At the same time, the paper also introduces the evaluation index accuracy, recall rate and F value. 2) based on the classical named entity recognition model, hidden Markov model, the paper optimizes the data of electronic commerce. A lexical based Hidden Markov Model (Lexical-HMM) is proposed to improve the accuracy of the model for the recognition of named entities in the context of e-commerce applications. For synonym extraction, a bipartite graph model based on user search and browsing behavior is proposed, which can extract entity synonyms efficiently and accurately. The algorithm effect is proved. 3) the information extraction system proposed in this paper is designed and validated. Based on Spark platform and artificial training set, the named entity library and synonym library can be extracted from input data efficiently and accurately by using DAG design method, and the efficiency and stability of the system are verified.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前10條
1 張曉艷;王挺;陳火旺;;命名實體識別研究[J];計算機科學(xué);2005年04期
2 邱莎;;幾種基于機器學(xué)習(xí)的生物命名實體識別模型比較[J];電腦知識與技術(shù)(學(xué)術(shù)交流);2007年05期
3 趙軍;;命名實體識別、排歧和跨語言關(guān)聯(lián)[J];中文信息學(xué)報;2009年02期
4 鄭強;劉齊軍;王正華;朱云平;;生物醫(yī)學(xué)命名實體識別的研究與進展[J];計算機應(yīng)用研究;2010年03期
5 張向U,
本文編號:2272451
本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/2272451.html
最近更新
教材專著