以下来源于百度翻译,仅作学习参考
Learning From Crowds 从人群中学习(弱监督学习)
摘要:For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline.
对于许多监督学习任务来说,获取客观可靠的标签可能是不可行的(或者很昂贵)。相反,我们可以从多个专家或注释员那里收集主观(可能有噪音)标签。在实际生活中,注释者之间存在大量的分歧,因此在这种情况下解决传统的监督学习问题具有很大的实际意义。本文描述了在多个注释器提供(可能有噪声)标签但没有绝对黄金标准的情况下监督学习的概率方法。该算法对不同的专家进行评估,并对实际隐藏标签进行估计。实验结果表明,该方法优于常用的多数投票基线法。
Learning from ambiguously labeled images 从模糊标记的图像中学习(多示例学习)
摘要:In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly, movie screenplays can tell us who is in the scene, but not when and where they are on the screen. We formulate the learning problem in this set- ting as partially-supervised multiclass classification where each instance is labeled ambiguously with more than one label. We show theoretically that effective learning is possible under reasonable assumptions even when all the data is weakly labeled. Motivated by the analysis, we propose a general convex learning formulation based on minimization of a surrogate loss appropriate for the ambiguous label setting. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies. We experiment on a very large dataset consisting of 100 hours of video, and in particular achieve 6% error for character naming on 16 episodes of LOST.
在许多图像和视频收藏中,我们可以访问仅限于部分标记的数据。例如,个人照片集合通常每个图像包含多个面和一个盖子-只指定图片中的人,但不指定名字和哪张脸匹配。同样,电影剧本也可以告诉我们谁在现场,但不要告诉我们他们在何时何地。在屏幕上。我们制定了这个集合中的累积问题-作为部分监督的多类分类,其中每个实例都有多个模糊的标签。我们从理论上证明了有效的学习是即使所有数据被弱标记。在分析的激励下,我们提出基于极小化的一般凸学习公式-适用于模糊标签的替代损失设置。我们将我们的框架应用于识别剔除的人脸从网络新闻源到电视节目中的命名字符-里斯和电影。我们在一个非常大的数据集上进行实验包括100小时的视频,尤其是16集《迷失》中的角色命名有6%的错误。
Learning from Ambiguously Labeled Face Images 人脸识别
摘要:Learning a classifier from ambiguously labeled face images is challenging since training images are not always explicitly-labeled. For instance, face images of two persons in a news photo are not explicitly labeled by their names in the caption. We propose a Matrix Completion for Ambiguity Resolution (MCar) method for predicting the actual labels from ambiguously labeled images. This step is followed by learning a standard supervised classifier from the disambiguated labels to classify new images. To prevent the majority labels from dominating the result of MCar, we generalize MCar to a weighted MCar (WMCar) that handles label imbalance. Since WMCar outputs a soft labeling vector of reduced ambiguity for each instance, we can iteratively refine it by feeding it as the input to WMCar. Nevertheless, such an iterative implementation can be affected by the noisy soft labeling vectors, and thus the performance may degrade. Our proposed Iterative Candidate Elimination (ICE) procedure makes the iterative ambiguity resolution possible by gradually eliminating a portion of least likely candidates in ambiguously labeled face. We further extend MCar to incorporate the labeling constraints between instances when such prior knowledge is available. Compared to existing methods, our approach demonstrates improvement on several ambiguously labeled datasets.
从模糊标记的人脸图像中学习分类器具有挑战性,因为训练图像并不总是明确标记的。例如,一张新闻照片中两个人的面部图像没有在标题中明确地用他们的名字标记。我们提出了一种模糊度分解矩阵完成(MCAR)方法来预测模糊标记图像的实际标签。然后,从消歧标签中学习标准监督分类器,对新图像进行分类。为了防止大多数标签控制mcar的结果,我们将mcar归纳为一个加权mcar(wmcar)来处理标签不平衡。由于wmcar为每个实例输出一个减少模糊度的软标记向量,我们可以通过将其作为wmcar的输入进行迭代优化。然而,这种迭代实现可能会受到噪声软标记向量的影响,因此性能可能会降低。我们提出的迭代候选消除(ICE)程序通过逐步消除带有模糊标签的人脸中最不可能出现的候选部分,使迭代模糊度解决成为可能。我们进一步扩展了MCAR,以便在现有知识可用时在实例之间加入标签约束。与现有的方法相比,我们的方法在几个模糊标记的数据集上得到了改进。
Cross-lingual part-of-speech tagging through ambiguous learning
通过模糊学习进行跨语言词性标注
摘要: When Part-of-Speech annotated data is scarce, e.g. for under-resourced languages, onecanturntocross-lingualtransfer and crawled dictionaries to collect partially supervised data. We cast this problem in the framework of ambiguous learning and show how to learn an accurat history-based model. Experiments on ten languages show significant improvements over prior state of the art performance.
当部分语音注释数据稀缺时,例如,对于资源不足的语言,一个CanturnToCross-LingualTransfer和爬行字典收集部分监督的数据。我们将这个问题置于模糊学习的框架中,并展示如何学习一个准确的基于历史的模型。十种语言的实验表明,它比现有的技术性能有了显著的改进。
Weakly-Supervised Classification of Pulmonary Nodules Based on Shape Characters 基于形状特征的肺结节弱监督分类 (医学影像分析)
摘要: Accurate classification and recognition of pulmonary nodules is an important and key process of Computer-Aided Diagnosis (CAD) system in lung cancer diagnose. Although it has become an increasingly popular research topic, it remains a lot of scientific and technical challenges. Not only do we lack the accurate and effective algorithm of recognition and classification, but also we have difficulties in shape features representation and samples labeling. So this paper presents a weakly-supervised method based on the Partial Label Error-Correcting Output Codes (PL-ECOC) algorithm for solving nodules’ classification problem. During the training phase, we use a small amount of labeled pulmonary nodules from experts as weakly-supervised information, for generating a binary classifier. This classifier will be used to compare the Humming distance with the testing sam- ples, in order to obtaining the final category labels. Experiments on Lung Imaging Database Consortium (LIDC) and real-world data sets have shown the efficient performance of our method.
肺结节的准确分类与识别是计算机辅助诊断(CAD)系统在肺癌诊断中的一个重要而关键的过程。虽然它已成为一个日益流行的研究课题,但它仍然面临着许多科学和技术上的挑战。我们不仅缺乏准确有效的识别和分类算法,而且在形状特征表示和样本标记方面也存在困难。因此,本文提出了一种基于部分标签纠错输出码(PL-ECOC)算法的弱监督方法来解决结节分类问题。在训练阶段,我们使用来自专家的少量标记肺结节作为弱监督信息,以生成二元分类器。该分类器将用于比较蜂鸣距离和测试样本,以获得最终类别标签。肺成像数据库联合体(LIDC)和实际数据集的实验表明了该方法的有效性。
Learning from ambiguously labeled examples
从模糊标记的例子中学习(K近邻算法)
摘要:Inducing a classification function from a set of examples in the form of labeled instances is a standard problem in supervised machine learning. In this paper, we are concerned with ambiguous label classification (ALC), an extension of this setting in which several candidate labels may be assigned to a single example. By extending three concrete classification methods to the ALC setting and evaluating their performance on benchmark data sets, we show that appropriately designed learning algorithms can successfully exploit the information contained in ambiguously labeled examples. Our results indicate that the fundamental idea of the extended methods, namely to disambiguate the label information by means of the inductive bias underlying (heuristic) machine learning methods, works well in practice.
在有监督机器学习中,从一组带有标签的实例中执行分类函数是一个标准问题。在本文中,我们讨论了模糊标签分类(alc),这是该设置的一个扩展,其中几个候选标签可以分配给一个例子。通过将三种具体的分类方法扩展到ALC设置并评估它们在基准数据集上的性能,我们表明适当设计的学习算法可以成功地利用模糊标记示例中包含的信息。我们的研究结果表明,扩展方法的基本思想,即通过启发式机器学习方法中的归纳偏差消除标签信息的歧义,在实践中效果良好。
Ambiguously Labeled Learning Using Dictionaries 使用字典模糊偏标记学习
摘要:We propose a dictionary-based learning method for ambiguously labeled multiclass classification, where each training sample has multiple labels and only one of them is the correct label. The dictionary learning problem is solved using an iterative alternating algorithm. At each iteration of the algorithm, two alternating steps are performed: 1) a confidence update and 2) a dictionary update. The confidence of each sample is defined as the probability distribution on its ambiguous labels. The dictionaries are updated using either soft or hard decision rules. Furthermore, using the kernel methods, we make the dictionary learning framework nonlinear based on the soft decision rule. Extensive evaluations on four unconstrained face recognition datasets demonstrate that the proposed method performs significantly better than state-of-the-art ambiguously labeled learning approaches.
我们提出了一种基于字典的模糊标记多类分类学习方法,其中每个训练样本都有多个标签,只有一个标签是正确的标签。字典学习问题用迭代交替算法求解。在算法的每次迭代中,执行两个交替的步骤:1)置信度更新和2)字典更新。每个样本的置信度定义为其模糊标签上的概率分布。使用软决策规则或硬决策规则更新字典。此外,利用核方法,基于软决策规则,将字典学习框架非线性化。对四个不受约束的人脸识别数据集进行了广泛的评估,结果表明,该方法的性能明显优于最先进的模糊标记学习方法。
A Regularization Approach for Instance-Based Superset Label Learning
基于实例的超集标签学习的正则化方法
摘要:Different from the traditional supervised learning in which each training example has only one explicit label, superset label learning (SLL) refers to the problem that a training example can be associated with a set of candidate labels,and only one of them is correct. Existing SLL methods are either regularization-based or instance-based, and the latter of which has achieved state-of-the-art performance. This is because the latest instance-based methods contain an explicit disambiguation operation that accurately picks up the groundtruth label of each training example from its ambiguous candidate labels. However, such disambiguation operation does not fully consider the mutually exclusive relationship among different candidate labels, so the disambiguated labels are usually generated in a nondiscriminative way, which is unfavorable for the instance-based methods to obtain satisfactory performance. To address this defect, we develop a novel regularization approach for instance-based superset label (RegISL) learning so that our instance-based method also inherits the good discriminative ability possessed by the regularization scheme. Specifically, we employ a graph to represent the training set, and require the examples that are adjacent on the graph to obtain similar labels. More importantly, a discrimination term is proposed to enlarge the gap of values between possible labels and unlikely labels for every training example. As a result, the intrinsic constraints among different candidate labels are deployed, and the disambiguated labels generated by RegISL are more discriminative and accurate than those output by existing instance-based algorithms. The experimental results on various tasks convincingly demonstrate the superiority of our RegISL to other typical SLL methods in terms of both training accuracy and test accuracy.
与传统的监督学习不同,每个训练样本只有一个明确的标签,超集标签学习(superset label learning,SLL)是指一个训练样本可以与一组候选标签相关联,其中只有一个是正确的。现有的SLL方法要么基于正则化,要么基于实例,后者已经达到了最先进的性能。这是因为最新的基于实例的方法包含一个显式的消歧操作,该操作可以从不明确的候选标签中准确地提取每个训练示例的groundtruth标签。但是,这种消歧操作并没有充分考虑不同候选标签之间的互斥关系,因此,消歧标签通常是以非区分的方式生成的,这不利于基于实例的方法获得满意的性能。为了解决这一缺陷,我们开发了一种新的正则化方法,例如基于超集标签(regisl)的学习,这样我们的基于实例的方法也继承了正则化方案所具有的良好的识别能力。具体来说,我们使用一个图来表示训练集,并且需要图上相邻的例子来获得类似的标签。更重要的是,对于每个培训示例,建议使用歧视术语来扩大可能标签和不可能标签之间的价值差距。因此,在不同候选标签之间部署了固有的约束条件,与现有的基于实例的算法相比,regisl生成的消歧标签具有更高的识别性和准确性。实验结果表明,该算法在训练精度和测试精度方面均优于其它典型的SLL算法。
Maximum margin partial label learning 最大间隔偏标记学习算法
摘要: Partial label learning deals with the problem that each training example is associated with a set of candidate labels, and only one among the set is the ground-truth label. The basic strategy to learn from partial label examples is disambiguation, i.e. by trying to recover the ground-truth labeling information from the candidate label set. As one of the major machine learning techniques, maximum margin criterion has been employed to solve the partial label learning problem. Therein, disambiguation is performed by optimizing the margin between the maximum modeling output from candidate labels and that from non-candidate labels. However, in this formulation the margin between the ground-truth label and other candidate labels is not differentiated. In this paper, a new maximum margin formulation for partial label learning is proposed which aims to directly maximize the margin between the ground-truth label and all other labels. Specifically, an alternating optimization procedure is utilized to coordinate ground-truth label identification and margin maximization. Extensive experiments show that the derived partial label learning approach achieves competitive performance against other state-of-the-art comparing approaches.
部分标签学习处理的问题是每个训练示例都与一组候选标签相关联,并且集合中只有一个是地面真值标签。从部分标签示例中学习的基本策略是消除歧义,即尝试从候选标签集恢复基本事实标签信息。作为机器学习的主要技术之一,最大裕度准则被用来解决局部标签学习问题。其中,通过优化候选标签的最大建模输出和非候选标签的最大建模输出之间的边界来实现消歧。然而,在这个公式中,基本真值标签和其他候选标签之间的差额没有区别。本文提出了一种新的局部标签学习的最大边际公式,其目的是直接最大化基本真值标签与其它标签之间的边际。具体来说,使用交替优化程序来协调地面真值标签识别和边缘最大化。大量实验表明,该方法与其他最先进的比较方法相比,具有更高的竞争力。
Classification with partial labels 部分标签分类
摘要:In this paper, we address the problem of learning when some cases are fully labeled while other cases are only partially labeled, in the form of partial labels. Partial labels are represented as a set of possible labels for each training example, one of which is the correct label. We introduce a discriminative learning approach that incorporates partial label information into the conventional margin-based learning framework. The partial label learning problem is formulated as a convex quadratic optimization minimizing the L2-norm regularized empirical risk using hinge loss. We also present an efficient algorithm for classification in the presence of partial labels. Experiments with different data sets show that partial label information improves the performance of classification when there is traditional fully-labeled data, and also yields reasonable performance in the absence of any fully labeled data.
在本文中,我们讨论了当一些案例被完全标记,而其他案例只被部分标记时,以部分标记的形式学习的问题。部分标签表示为每个培训示例的一组可能的标签,其中一个是正确的标签。我们引入一种区分学习方法,将部分标签信息纳入传统的基于边际的学习框架。将局部标签学习问题定义为一个凸二次优化问题,利用铰链损失最小化二阶范数正则化经验风险。我们还提出了一种有效的分类算法存在部分标签。对不同数据集的实验表明,当存在传统的完全标记数据时,部分标记信息提高了分类的性能,并且在没有完全标记数据的情况下也能产生合理的性能。
网友评论