缺陷定位軟件庫挖掘：修正向量空間模型與預訓練詞嵌入的比較分析

發(fā)布時間：2023-06-04 05:07

　　軟件倉庫挖掘領(lǐng)域可以分析軟件倉庫中的數(shù)據(jù)以便促進軟件的開發(fā)過程。雖然版本控制系統(tǒng)、缺陷跟蹤系統(tǒng)、通信檔案、設(shè)計要求和文檔中存在大量數(shù)據(jù),但是由于其高度非結(jié)構(gòu)化性質(zhì),研究人員在將其用于分析時仍面臨挑戰(zhàn)。從事軟件倉庫挖掘的人員試圖解決的任務之一是缺陷定位。定位源代碼中的缺陷是很困難的。眾所周知,手動缺陷定位的過程是乏味且困難的,因此開發(fā)人員會花費大量的時間在這上。缺陷定位的目的在于能夠基于缺陷報告自動識別有缺陷的源代碼文件。即使有大量的自動化技術(shù),該領(lǐng)域尚未充分發(fā)揮其潛力并實現(xiàn)商業(yè)化。因此,自動缺陷定位仍然是一個懸而未決的問題,研究團體對此表現(xiàn)出了極大的興趣。隨著最近自然語言處理領(lǐng)域的發(fā)展,許多用于將單詞嵌入向量中的模型已經(jīng)被提出。它們是基于分布假設(shè),即單詞含義的相似度由它們在向量空間中的相似度表示。這種模型允許我們通過觀察單詞向量表示之間的距離來度量單詞之間的語義相似性。本文結(jié)合信息檢索模型,探討了詞嵌入預訓練模型在缺陷定位中的有效性。通過使用不同的預處理技術(shù),提出的模型由檢索與所分析的缺陷報告相關(guān)的源代碼文件的排序列表的能力進行評估。缺陷定位能夠處理具有非結(jié)構(gòu)化性質(zhì)的數(shù)據(jù),如缺陷報告、...

【文章頁數(shù)】：71 頁

【學位級別】：碩士

【文章目錄】：
Abstract
摘要
Chapter1 Introduction
    1.1 Introduction to Mining Software Repositories
    1.2 Background research and objectives
        1.2.1 Research Objectives and contribution of the thesis
        1.2.2 Background research
        1.2.3 Motivation
    1.3 Literature Review and Analysis
Chapter2 Theoretical Background
    2.1 Version Control Systems
        2.1.1 Source Forge
        2.1.2 Git Hub
    2.2 Bug tracking systems
        2.2.1 Bugzilla
    2.3 Information retrieval
        2.3.1 Common terminology
    2.4 Commonly used IR models
        2.4.1 Vector Space Model(VSM)
        2.4.2 Revised Vector Space Model(r VSM)
        2.4.3 Latent Semantic Indexing(LSI)
        2.4.4 Probabilistic Latent Semantic Indexing(PLSI)
        2.4.5 Latent Dirichlet Allocation
    2.5 Word embeddings
        2.5.1 Vector space model and statistical language model
        2.5.2 Representing text with embeddings
        2.5.3 Types of word embeddings
    2.6 Abstract Syntax Trees
    2.7 Summary
Chapter3 Bridging the Lexical Gap
    3.1 Pretrained Word Embedding Models
        3.1.1 word2Vec model trained on Stack Overflow posts
        3.1.2 Fast Text model trained on Common Crawl
        3.1.3 Glo Ve model trained on Common Crawl
        3.1.4 fast Text model trained on source code files
    3.2 Types of similarity
        3.2.1 Lexical similarity
        3.2.2 Semantic similarity
    3.3 Similarity measures
        3.3.1 Cosine similarity
        3.3.2 Word Mover distance
    3.4 Objective Function and Optimization
        3.4.1 Differential evolution
    3.5 Structure of the model
    3.6 Summary
Chapter4 Experimental Setup And Results
    4.1 Data collection
    4.2 Parsing and preprocessing
        4.2.1 Tokenization and linguistic preprocessing of tokens
    4.3 Experiments with different preprocessing techniques
        4.3.1 Embedding whole content of source files
        4.3.2 Parsing ASTs of source code files
    4.4 Experiments with different pretrained vectors
    4.5 Evaluation
    4.6 Results
        4.6.1 fast Text vectors trained on Common Crawl data
        4.6.2 Glo Ve vectors trained on Common Crawl data
        4.6.3 Word2Vec vectors trained on Stack Overflow data
    4.7 Comparison with other models
        4.7.1 Comparison with the base r VSM model
        4.7.2 Comparison of the proposed model with Bug Locator
    4.8 Summary
Conclusion
References
Acknowledgements
Resume

本文編號：3830746

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/3830746.html

上一篇：軟件系統(tǒng)漏洞發(fā)現(xiàn)及傾向性預測關(guān)鍵技術(shù)研究
下一篇：安卓動態(tài)權(quán)限申請機制漏洞修復方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

缺陷定位軟件庫挖掘：修正向量空間模型與預訓練詞嵌入的比較分析