2020-05-14
在从stack overflow选择doc*相关的questions时,该文章的方法比较科学有信服力,值得借鉴。【注:该文章还描述了从GitHub的issues、PRs和mailing list如何抽取相关数据的描述,应该也有价值,但未细看】https://github.com/REVEAL-ICSE19-DocIssues/ReplicationPackage
具体原文描述如下:
Stack Overflow (SO). We mined from the official SO dump of June 2018 all discussions having a question labeled with a documentation-related tag. To determine these tags, we searched for all tags related to documentation and documentation tools in the SO tag page by using the
keywords doc, documentation and documentor. The latter term is known to be part of the name of tools supporting software documentation. One author then inspected all the tags resulting from these three searches to identify the ones actually related to software documentation and/or documentation tools. During the inspection, the author read the tag name, the tag description and some of the questions in which the tag was used. This process resulted in the selection of 23 tags (e.g., code-documentation, phpdocumentor, design-documents) that were used to search for the related discussions in SO. The first 30 results (discussions) returned by the 23 searches were manually inspected to look for additional documentation-related tags missed in the first step. The process was iterated with the newly founded tags until no new tags were found in the top 30 results of the tag searches. This resulted in a total of 78 (23+55) documentation-related tags (available in our replication package [56]).
Next, we queried the SO dump to extract all discussions having a question with a non-negative score and tagged with one or more of the relevant 78 tags. We removed questions with a negative score to filter out irrelevant discussions. This process resulted in the selection of 28,792 discussions. For each of them, we kept the question, the two top-scored answers and the accepted answer (if any).
大概步骤总结就是:
先用人能想到的(应该是最准确和合适的)keywords去搜SO的tag page https://stackoverflow.com/tags。然后在搜索结果中,人为去对每个tag的name和description读一读,为了进一步保证理解的准确性(个人认为),对每个tag的一些questions进行人为阅读(该细节不够,到底怎么选的some questions,most relevant?most voted?or random啊?个人偏向于most relevant)。
这样就依据tags搜索得到了一些tags(23个),然后呢,是一个滚雪球方式吧。首先用这些tags(应该是[tag1] or [tag2] or [xx] or [tag23])去搜索SO上的questions,人为查看top 30的questions的标签, 选择那些跟software documentation相关tags,加入到先前的tags中,继续搜索,迭代,直到top30里没有新的合适的software documentation的tags了,这样是最终拿到了78(23+55)个tags。
随后就是用这78个tags去搜questions,这里就存在过滤questions的问题了,把score (number of upvotes - number of downvotes)为负值的questions给过滤了(被视为irrelevant discussions)。然后为了文章目的,保留了score为正值,且top two score的answers和accepted answer(绿色勾勾,不是每个question都有)
论文之外:
话说查找tags,另一个思路是不是可以这样:
把SO的tags全部拿出来(很多tag是没有description和questions的,其实可以直接过滤掉),人为去check每个tag的name和description,把相关的tags找出来,然后再每个去check些questions,最后定下来tags。这样理论上来说是不是其实更合理些?
网友评论