缺陷定位軟件庫(kù)挖掘:修正向量空間模型與預(yù)訓(xùn)練詞嵌入的比較分析
發(fā)布時(shí)間:2023-06-04 05:07
軟件倉(cāng)庫(kù)挖掘領(lǐng)域可以分析軟件倉(cāng)庫(kù)中的數(shù)據(jù)以便促進(jìn)軟件的開(kāi)發(fā)過(guò)程。雖然版本控制系統(tǒng)、缺陷跟蹤系統(tǒng)、通信檔案、設(shè)計(jì)要求和文檔中存在大量數(shù)據(jù),但是由于其高度非結(jié)構(gòu)化性質(zhì),研究人員在將其用于分析時(shí)仍面臨挑戰(zhàn)。從事軟件倉(cāng)庫(kù)挖掘的人員試圖解決的任務(wù)之一是缺陷定位。定位源代碼中的缺陷是很困難的。眾所周知,手動(dòng)缺陷定位的過(guò)程是乏味且困難的,因此開(kāi)發(fā)人員會(huì)花費(fèi)大量的時(shí)間在這上。缺陷定位的目的在于能夠基于缺陷報(bào)告自動(dòng)識(shí)別有缺陷的源代碼文件。即使有大量的自動(dòng)化技術(shù),該領(lǐng)域尚未充分發(fā)揮其潛力并實(shí)現(xiàn)商業(yè)化。因此,自動(dòng)缺陷定位仍然是一個(gè)懸而未決的問(wèn)題,研究團(tuán)體對(duì)此表現(xiàn)出了極大的興趣。隨著最近自然語(yǔ)言處理領(lǐng)域的發(fā)展,許多用于將單詞嵌入向量中的模型已經(jīng)被提出。它們是基于分布假設(shè),即單詞含義的相似度由它們?cè)谙蛄靠臻g中的相似度表示。這種模型允許我們通過(guò)觀察單詞向量表示之間的距離來(lái)度量單詞之間的語(yǔ)義相似性。本文結(jié)合信息檢索模型,探討了詞嵌入預(yù)訓(xùn)練模型在缺陷定位中的有效性。通過(guò)使用不同的預(yù)處理技術(shù),提出的模型由檢索與所分析的缺陷報(bào)告相關(guān)的源代碼文件的排序列表的能力進(jìn)行評(píng)估。缺陷定位能夠處理具有非結(jié)構(gòu)化性質(zhì)的數(shù)據(jù),如缺陷報(bào)告、...
【文章頁(yè)數(shù)】:71 頁(yè)
【學(xué)位級(jí)別】:碩士
【文章目錄】:
Abstract
摘要
Chapter1 Introduction
1.1 Introduction to Mining Software Repositories
1.2 Background research and objectives
1.2.1 Research Objectives and contribution of the thesis
1.2.2 Background research
1.2.3 Motivation
1.3 Literature Review and Analysis
Chapter2 Theoretical Background
2.1 Version Control Systems
2.1.1 Source Forge
2.1.2 Git Hub
2.2 Bug tracking systems
2.2.1 Bugzilla
2.3 Information retrieval
2.3.1 Common terminology
2.4 Commonly used IR models
2.4.1 Vector Space Model(VSM)
2.4.2 Revised Vector Space Model(r VSM)
2.4.3 Latent Semantic Indexing(LSI)
2.4.4 Probabilistic Latent Semantic Indexing(PLSI)
2.4.5 Latent Dirichlet Allocation
2.5 Word embeddings
2.5.1 Vector space model and statistical language model
2.5.2 Representing text with embeddings
2.5.3 Types of word embeddings
2.6 Abstract Syntax Trees
2.7 Summary
Chapter3 Bridging the Lexical Gap
3.1 Pretrained Word Embedding Models
3.1.1 word2Vec model trained on Stack Overflow posts
3.1.2 Fast Text model trained on Common Crawl
3.1.3 Glo Ve model trained on Common Crawl
3.1.4 fast Text model trained on source code files
3.2 Types of similarity
3.2.1 Lexical similarity
3.2.2 Semantic similarity
3.3 Similarity measures
3.3.1 Cosine similarity
3.3.2 Word Mover distance
3.4 Objective Function and Optimization
3.4.1 Differential evolution
3.5 Structure of the model
3.6 Summary
Chapter4 Experimental Setup And Results
4.1 Data collection
4.2 Parsing and preprocessing
4.2.1 Tokenization and linguistic preprocessing of tokens
4.3 Experiments with different preprocessing techniques
4.3.1 Embedding whole content of source files
4.3.2 Parsing ASTs of source code files
4.4 Experiments with different pretrained vectors
4.5 Evaluation
4.6 Results
4.6.1 fast Text vectors trained on Common Crawl data
4.6.2 Glo Ve vectors trained on Common Crawl data
4.6.3 Word2Vec vectors trained on Stack Overflow data
4.7 Comparison with other models
4.7.1 Comparison with the base r VSM model
4.7.2 Comparison of the proposed model with Bug Locator
4.8 Summary
Conclusion
References
Acknowledgements
Resume
本文編號(hào):3830746
【文章頁(yè)數(shù)】:71 頁(yè)
【學(xué)位級(jí)別】:碩士
【文章目錄】:
Abstract
摘要
Chapter1 Introduction
1.1 Introduction to Mining Software Repositories
1.2 Background research and objectives
1.2.1 Research Objectives and contribution of the thesis
1.2.2 Background research
1.2.3 Motivation
1.3 Literature Review and Analysis
Chapter2 Theoretical Background
2.1 Version Control Systems
2.1.1 Source Forge
2.1.2 Git Hub
2.2 Bug tracking systems
2.2.1 Bugzilla
2.3 Information retrieval
2.3.1 Common terminology
2.4 Commonly used IR models
2.4.1 Vector Space Model(VSM)
2.4.2 Revised Vector Space Model(r VSM)
2.4.3 Latent Semantic Indexing(LSI)
2.4.4 Probabilistic Latent Semantic Indexing(PLSI)
2.4.5 Latent Dirichlet Allocation
2.5 Word embeddings
2.5.1 Vector space model and statistical language model
2.5.2 Representing text with embeddings
2.5.3 Types of word embeddings
2.6 Abstract Syntax Trees
2.7 Summary
Chapter3 Bridging the Lexical Gap
3.1 Pretrained Word Embedding Models
3.1.1 word2Vec model trained on Stack Overflow posts
3.1.2 Fast Text model trained on Common Crawl
3.1.3 Glo Ve model trained on Common Crawl
3.1.4 fast Text model trained on source code files
3.2 Types of similarity
3.2.1 Lexical similarity
3.2.2 Semantic similarity
3.3 Similarity measures
3.3.1 Cosine similarity
3.3.2 Word Mover distance
3.4 Objective Function and Optimization
3.4.1 Differential evolution
3.5 Structure of the model
3.6 Summary
Chapter4 Experimental Setup And Results
4.1 Data collection
4.2 Parsing and preprocessing
4.2.1 Tokenization and linguistic preprocessing of tokens
4.3 Experiments with different preprocessing techniques
4.3.1 Embedding whole content of source files
4.3.2 Parsing ASTs of source code files
4.4 Experiments with different pretrained vectors
4.5 Evaluation
4.6 Results
4.6.1 fast Text vectors trained on Common Crawl data
4.6.2 Glo Ve vectors trained on Common Crawl data
4.6.3 Word2Vec vectors trained on Stack Overflow data
4.7 Comparison with other models
4.7.1 Comparison with the base r VSM model
4.7.2 Comparison of the proposed model with Bug Locator
4.8 Summary
Conclusion
References
Acknowledgements
Resume
本文編號(hào):3830746
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/3830746.html
最近更新
教材專(zhuān)著