UbiComp2019 EduSense: Practical

作者: Hoyer | 来源:发表于2022-05-07 20:55 被阅读0次

0、关键词

Classroom, Sensing, Teacher, Instructor, Pedagogy, Computer Vision, Audio, Speech Detection, Machine Learning

1、链接

该论文来自卡内基梅隆大学（CMU）的人机交互（Human-Computer Interaction Institute）研究团队，主要工作由一作博士生Karan Ahuja完成，其个人官网上有其它相关类似工作。

论文链接：https://dl.acm.org/doi/abs/10.1145/3351229

论文/系统主页：https://theedusense.io/ or https://www.edusense.io/

论文代码：https://github.com/edusense/edusense or https://github.com/edusense/ClassroomDigitialTwins

自制公开思维导图：https://www.processon.com/mindmap/602fa53ce401fd48f2ae17e6

论文提出的EduSense代表了第一个实时、现场评估和实际部署的大规模教室感知系统，该系统产生了大量与有效教学相关的、理论上有动机的视觉和音频功能。（EduSense represents the first real-time, in-the-wild evaluated and practically-deployable classroom sensing system at scale that produces a plethora of theoretically-motivated visual and audio features correlated with effective instruction.）编者注：实际上，我们实验室早在2016年就开展了类似的工作，但技术路线并不相同，可惜未能发表类似的系统性文章。

Top row: Example classroom scenes processed by EduSense. Bottom row: Featurized data, including body and face keypoints, with icons for hand raise, upper body pose, smile, mouth open, and sit/stand classification.

2、主要内容概述

※ Abstract

1) High-quality opportunities for professional development of university teachers need classroom data.

2) Currently, there is no effective mechanism to give personalized formative feedback except manually.

3) This paper shows a culmination of two years of research: EduSense (with visual and audio features)

4) EduSense is the first to unify previous isolative features into a cohesive, real-time, and practically-deployable system

※ Introduction

> 增加学生在课程中的投入度和参与度(engagement and participation)被证明可以有效提升学习产出；

> 与K-12的教师相比，大学教师一般仅仅是领域专家(domain experts)，而不擅长如何教学生

> 正常且规律的教学反馈对教师提升教学技能很重要，想要习得教育学技巧(pedagogical skill)并不容易

> acquiring regular, accurate data on teaching practice is currently not scalable

> 当今的教学反馈数据严重依赖专业人士的观察(professional human observers)，而这是非常昂贵的

Tabel-1. Features in EduSense

> EduSense captures a wide variety of classroom facets shown to be actionable in the learning science literature, at a scale and temporal fidelity many orders of magnitude beyond what a traditional human observer in a classroom can achieve.

> EduSense captures both audio and video streams using low-cost commodity hardware that views both the instructor and students

> Detection: hand raises, body pose, body accelerometry, and speech acts. Tabel-1 is the detail.

> EduSense是首个将之前所有众多单个教学场景特征融合在一起的系统

> EduSense力求做到两件事：

1）为教学者提供教育学相关的教室上课场景数据供其练习成长

2）成为一个可拓展的开放平台

※ Related Systems

> There is an extensive learning science literature on methods to improve instruction through training and feedback. [15] [26] [27] [32] [37] [38] [77] [78] (PS：枚举的好像全是CMU出产的文章)

2.1 Instrumented Classrooms （仪器教室）

> 使用一些传感器（如pressure sensors [2][58]）收集课堂中学生的数据，或者使用仪器测量教室的物理结构。

● adding computing to the tabletop (e.g., buttons, touchscreens, etc.) or with response systems like "clickers" [1][12][20][21][68]

● low-cost printed responses using color markers [25], QR Codes [17] or ARTags [57]

> 使用可穿戴设备直接搜集精确的关于学生或教师的信号

● Affectiva’s wrist-worn Q sensor [62] senses the wearer’s skin conductance, temperature and motion (via accelerometers)

● EngageMeter [32] used electroencephalography headsets to detect shifts in student engagement, alertness, and workload

● Instrument just the teacher, with e.g., microphones [19].

> 缺点：带来了社交障碍、审美损失和实际应用的成本提升（carries a social, aesthetic and practical cost.）

2.2 Non-Invasive Class Sensing (非侵入式等级感应)

> 我们的初衷是使用尽量少的入侵式设备来最大化应用价值。在众多的非入侵式传感器中，声音和视觉（acoustic and visual）几乎是课堂感知必备的

> Speech

● [19] used an omnidirectional（全方位的） room microphone and head-mounted teacher microphone to automatically segment teacher and student speech events, as well as intervals of silence (such as after teacher questions).

● AwareMe [11], Presentation Sensei [46] and RoboCOP [75] （Oral presentation practice systems 口头演讲练习系统）compute speech quality metrics, including pitch variety, pauses and fillers, and speaking rate.

> Cameras and computer vision

● Early systems, such as [23], targeted coarse tracking of people in the classroom, in this case using background subtraction and color histograms.

● Movement of students has also been tracked with optical flow algorithms, as was demonstrated in [54][63]

● Computer vision has also been applied to automatic detection of hand raises, including classic methods such as skin tone and edge detection [41], as well as newer deep learning techniques [51]（我们实验室的文章，linjiaojiao的举手检测）

> Face Detection

● It can not only be used to find and count students, but also estimate their head orientation, coarsely signaling their area of focus [63][73][80].

● Facial landmarks can offer a wealth of information about students' affective state, such as engagement [76] and frustration [6][31][43], as well as detection of off-task behavior [7]

● The Computer Expression Recognition Toolbox (CERT) [52] is most widely used in these educational technology applications, though it is limited to videos of single students.

2.3 System Contribution

> 按例先踩一下上述提到的各种教室感知系统：

1）都是独立发表各项孤立指标，且没有在真实的大规模课堂场景下进行过测试和验证

2）各个系统都是针对单间教室配置单台服务器，不能在校园层面大规模推广

3）这些文献中的系统很少处理教学教育用途，因此没有考虑到在复杂的教室场景中使用最新的大量取得突破发展的计算机视觉和深度学习技术

> Thus, we believe EduSense is unique in putting together disparate advances from several fields into a comprehensive and scalable system, paired with a holistic evaluation combining both controlled studies and months-long, real-world deployments.

※ EduSense System

System Architecture. Four key layers: Classrooms layer, Processing layer, Datastore layer, and Apps layer

3.1 Sensing

> Early system：depth cameras

> Current system：Lorex LNE8950AB cameras offer a 112° field of view and feature an integrated microphone, costing around $150 in single unit retail prices. It can capture 3840x2160 video (i.e., 4K) at 15 FPS with 16 kHz mono audio.

Hardwares

3.2 Compute

> Early system：

● small Intel NUCs. However, this hardware approach was expensive to scale, deploy and maintain

● 前期版本的系统是一个庞大而单一的（monolithic）C++应用程序，不但容易遇到各种如依赖冲突和加入新模块引起过载等软件工程问题，而且软件的远程部署同样是一个让人头疼的问题。

● 另外，这些C++版本的代码很难和计算机视觉最常用的python语言相结合，即使强行合并，也是耗时且系统极不稳定。这个旧版本的系统也因为各个组件模块之间没有相互隔离而很容易发生各种错误或崩溃掉。

> Current system：

● 新的系统使用了更加稳定的IP cameras，配合布置在学校中心的服务器，两者之间再通过RTSP协议实时传输音频和视频流，形成新的系统框架。

● The custom GPU-equipped EduSense server has 28 physical cores (56 cores with SMT), 196GB of RAM and nine NVIDIA 1080Ti GPUs

● 新系统使用了docker容器技术（container-based virtualization），将各个模块孤立开单独执行，docker的优势无需赘述。

3.3 Scene Parsing (Techniques)

> Multi-person body keypoint (joints) detection: OpenPose (tested and tuned OpenPose parameters)

> Difficult Envoriment：high, wall-mounted (i.e., non-frontal) and slightly fish-eyed view.

> Algorithm：additional logic to reduce false positive bodies (e.g., bodies too large or small); interframe persistent person IDs with hysteresis (tracking) using a combination of Euclidean distance and body inter-keypoint distance matching

> Speech：predict only silence and speech (Laput et al. [48].) + An adaptive background noise filter

Fig. 3. Processing pipeline. Video and audio from classroom cameras first flows into a scene parsing layer, before being featurized by a series of specialized modules. See also Figure 1 and Table 1.

3.4 Featurization Modules

> 见图1和图3，特征化模块主要利用检测和识别算法的结果，将其按照教室中的指标可视化，便于调用或debug时查看

> For details：open source code repository (http://www.EduSense.io).

● Sit vs. Stand Detection：relative geometry of body keypoints（neck (1), hips (2), knees (2), and feet (2).）+ MLP classifier

● Hand Raise Detection：Use eight body keypoints per body（neck (1), chest (1), shoulder (2), elbow (2), and wrist (2).）+ MLP classifier

● Upper Body Pose：eight body keypoints + multiclass MLP model（预测arms at rest, arms closed (e.g., crossed), and hands on face 见上图5）

● Smile Detection：use ten mouth landmarks on the outer lip and ten landmarks on the inner lip + SVM for binary classification

● Mouth Open Detection：(As a potential, future way to identify speakers.) two features from [71] (left and right/mouth_width) + Binary SVM

● Head Orientation & Class Gaze：perspective-n-point algorithm [50] + anthropometric face data [53] + OpenCV's calib3d module [8]

● Body Position & Classroom Topology：借助前面提到的人脸关键点和相机标定，估测学生的位置，并将投影合成俯视视角（top-down view）图像（PS：类似我们系统中的学生定位，这里更粗略，不检测行列，也不涉及学生行为匹配）

● Synthetic Accelerometer：simply track the motion of bodies across frames + 3D head position + delta X/Y/Z normalized by the elapsed time

● Student vs. Instructor Speech：sound and speech detector including 1) the RMS of the student-facing camera’s microphone (closest to the instructor), 2) the RMS of the instructor-facing camera’s microphone (closest to the students), and the ratio between the latter two values + random forest classifier （目的是区分当前的说话声来自学生还是老师，PS：区分教师音和学生音？）

● Speech Act Delimiting：Use per-frame speech detection results？？？（PS：这里是要检测不同的语音片段吗？）

Fig. 5. Example participant from our controlled study. EduSense recognizes three upper body poses (left three image)and various hand raises (right four images). Live classification from our upper body pose (orange text) and hand classifiers (yellow text) are shown.

3.5 Training Data Capture

> 首先，各种指标的实现需要大量标注过的数据作为训练集，这里遇到两个问题：

1）需要招聘大量人员参与标注，如举手

2）需要采集不同视角下的多样化的数据，因此需要自己布置采集数据的硬件设备和场景

Fig. 6. Left: Training data capture rig in an example classroom. Right: Closeup of center mast, with six cameras.

3.6 Datastore

1）non-image classroom data （ASCII JSON），250MB for one class lasting around 80 minutes with 25 students

2）Infilled data （real-time class video）, about 16GB for one class at 15FPS with 4K every frame for both front and back cameras

3）Web interface (Go APP) and MongoDB build a backend server. Also REST API + Transport Layer Security (TLS) (不同的技术路线和技术细节)

4）We do not save these frames long-term to mitigate obvious privacy concerns （数据不长期保存，一删了之，避免隐私问题）

5）secure Network Attached Storage （NAS）

3.7 Automated Scheduling & Classroom Processing Instances

> scheduler：SOS JobScheduler （技术路线不同，我们使用的是python平台下的开源调度器apscheduler）

> FFMPEG instances：record the front and back camera streams (技术路线不同，我们使用的是opencv)

3.8 High Temporal Resolution Infilling

EduSense包括两种数据处理模式：real-time mode（0.5FPS）；infilling mode（15FPS的视频）

> real-time模式，顾名思义需要在课程进行时同时出现各种分析指标，目前的效率是两秒钟一帧

> infilling模式，是在课程同时进行或课后进行的非实时分析，提供了高时序分辨率（high temporal resolution），是实时处理系统的补充。另外，这种更精确的分析还可以用于后续的end-of-day reports或semester-long analytics

3.9 Privacy Preservation

> 已经采取的措施：EduSense不专门存储课堂视频；如果需要infilling模式，会在临时缓存中暂存，并在分析完成后删除这些视频；控制用户分权限分角色访问教室数据，防止数据泄露；追踪学生个体，但是并没有使用私密信息，且每节课tacking分配的ID互相之间没有关联；暂存的用于后续发展的视频（包括测试、验证和标注后扩充数据集），将在使用后被及时删除

> 未来将要采取的措施：仅仅只展示高阶抽象的课堂指标数据（class aggregates）；

3.10 Debug and Development Interface

QT5 GUI + RTSP/local filesystem + many widgets

Fig. 7. Although EduSense is mostly launched as a headless process, we built a utilitarian graphical user interface for debugging and demonstration.

3.11 Open Source and Community Involvement

● hope that others will deploy the system

● serve as a comprehensive springboard

● cultivate a community

※ Controlled Study

4.1 Overall Procedure

> five exemplary classrooms, 5 instructors and 25 student participants

> 参与者按照事先提供的“指令表格”，依次按照相应的要求做出动作，同时debug系统会同时记录下这些动作的时刻、类型、以及图像数据

预先定义的“指令表格”

4.2 Body Keypointing

> Openpose被用来做姿态估计，但其在教室场景下的效果并不鲁棒，因此作者调整了算法的一些参数，外加一些pose的逻辑判断，提升了算法的稳定性和准确度（和我改进openpose的思路差不多？）

> 关于改进后openpose的效果，作者也没给出较严谨的测试结果，只是在少量数据集统计了关键点的效果（这种方式有道理吗？）

> 如下图，作者又统计了9种人体关键点的检测准确度，显然上半身比下半身的准确率要高（但这些准确率是在多少数据下统计的不可知）

Fig. 10. Histogram showing the percent of different body keypoints found in three of our experimental contexts.

4.3 Phase A: Hand Raises & Upper Body Pose

> 作者分了七种上身姿态类别：arms resting, left hand raised, left hand raised partial, right hand raised, right hand raised partial, arms closed, and hands on face

> 参与实验的学生被要求在一堂课中，分别要执行三次这些姿态类别，共计21个实例

> 参与实验的老师被要求，分别要执行arms resting和arms closed三次，且在不同的教室位置(left front, center front, right front)，共计6个实例

> We only studied frames where participants’ upper bodies were captured (consisting of head, chest, shoulder, elbow, and wrist keypoints - without these eight keypoints, our hand raise classifier returns null).

> 另外，作者在文中提到的举手检测准确率高达94.6%，其它三类上身姿态检测准确率高达98.6%（学生）和100%（教师），但是没有提到训练集和测试集的规模，且这些都是在特定布置的实验场景中的结果，是否有说服力呢？

4.4 Phase B: Mouth State

> 作者设定了4种嘴部状体：neutral (mouth closed), mouth open (teeth apart, as if talking), closed smile (no teeth showing), teeth smile (with teeth showing)

> 参与学生被要求每种状态执行三次，共计12个实验样例；

> 参与教师被要求每种状态执行三次，且在教室前面的不同位置，共计12个实验样例

> 基于以上人脸landmarks检测，作者做了微笑分类（准确率78.6%和87.2%），以及张嘴分类（准确率83.6%和82.1%）。但是仍旧没提数据量

> 作者坦承，由于分辨率问题，后排的学生人脸几乎不可准确检测landmarks，并乐观地认为高分辨率相机可以解决该问题。（实际上我们测试，即使是4K相机，仍旧存在低分辨率问题，且landmarks还有大角度和遮挡的问题）

Fig. 11. The mouth states captured in our controlled study: mouth closed, closed smile, teeth smile, and mouth open.

4.5 Phase C: Sit vs. Stand

> 这里作者主要是区分站立和坐下两种姿势。

> 同样按照前面的安排，学生参与者被要求在整个测试过程中，随机执行三次两种姿势，每个参与者共计6个实例；教师总是保持站立，本轮不参与

> 站立和坐下的分类准确率约为84.4%（尽管作者还是没提是在多大的数据集上测试的结果，但从这一章节提供的错误率推断出，总样例数量约为143）

> 由于只是依赖2D关键点检测的结果来分类，作者提到这种方法受到相机视角的影响很大。（那是当然，还是没有我们直接检测站立准确，且鲁棒性高）

> 作者最后又提到，将来可以使用深度数据，改善这种情况。（我只能说深度相机也不见得有用，况且深度数据并不好采集和用来训练）

4.6 Phase D: Head Orientation

> 作者设定了8种头部朝向：three possible pitches (“down” -15°, “straight” 0°, “up” +15°) × three possible yaws (“left” -20°, “straight” 0°, “right” +20°), omitting directly straight ahead (i.e., 0°/0°) （仍旧是将检测和估计问题，转化成了分类问题）

> 为了让参与者做出相应的head pose，作者设计使用运行位姿估计APP的智能手机，以及打印出来操作表格贴在课桌上。相关流程请阅读论文

> 同样，学生参与者被要求分别执行8种头部方向2次，这样每个人会产生16个实验样例

> Unfortunately, in many frames we collected, ~20% of landmarks were occluded by the smartphones we gave participants - an experimental design error in hindsight.（果不其然，这种依靠人脸landmarks的头部姿态估计方式，即使是在实验场景下，结果也并不靠谱）

> Which should be sufficient for coarse estimation of attention.（作者删除掉一些landmarks检测不好的样例，仅仅剩下了1/4的数据，在这种情况下测试的结果，还要说sufficient，有点勉强了，甚至睁眼说瞎话了）

> 作者最后提到，主要问题还是出在landmarks的检测，将来能检测出来充足的landmarks点，就能解决头部朝向问题。（我对这种技术路线持保守态度）

Fig. 12. Example head orientations requested in our study, with detected face landmarks shown.

4.7 Phase E: Speech Procedure

> 这里只是识别是否有说话，包括教师和学生，但并未做区分

> 实验方案是要求30个参与者分别说一次话，这样说说话语音段可以提取出30个5秒钟长的clips，非说话语音段同样可以提出30个段，再对这些语音段做分类。最终，no speech的识别100%正确，speech的识别仅有一个错误，准确率98.3%

> 我只能评价说，这样的语音指标和处理流程太过简单，且测试数据量太少，很缺乏说服力

4.8 Face Landmarks Results

> 人脸关键点检测直接使用了公共算法，如文献[4][13][44]。猜测大概率使用的是文献[13]（CMU的Openpose）

> 同样是在实验环境下，这一段展示了缺乏说服力的所谓关键点检测准确率

> poor registration of landmarks was due to limited resolution （还是提到了低分辨率的问题）

4.9 Classroom Position & Sensing Accuracy vs. Distance

> We manually recorded the distance of all participants from the camera using a surveyors’ rope

> Computer-vision-driven modules are sensitive to image resolution and vary in accuracy as a function of distance from the camera.

> 这里有个疑问：教师和学生的检测不会重复吗？换句话说双方不会出现在彼此的镜头里面吗？如果出现了，文中并没有考虑如何区分两者。

distance estimation

4.10 Framerate and Latency

> 测试阶段，只考虑处理已保存的视频数据，暂不考虑实时系统

> 不出意外，关键点检测（body keypointing）和人脸关键点检测（face landmarking）两种基础映射函数占据了大部分时间。尤其是人脸关键点定位算法的耗时，和图像中的人物数量呈正相关函数增长.（这里有点疑问，姿态估计使用的是Bottom-up的openpose算法，所以检测耗时不随人数增长而简单地线性增长，但上图中，人数从0增加到54，检测耗时完全没有增加，这显然是假的。因为我实测过，openpose在joints grouping环节，也会占据部分CPU时间。另外，openpose算法本身的检测耗时只有约几十毫秒，也不可信，输入图像即使只有1K图像的0.5倍大小，也需要1秒左右的时间。）

> 其他处理流程的耗时，暂看不出问题

Fig. 15. Runtime performance of EduSense’s various processing stages at different loads (i.e., number of students).

※ Real-world Classrooms Study

5.1 Deployment and Procedure

> We deployed EduSense in 13 classrooms at our institution and recruited 22 courses for an "in-the-wild" evaluation (with a total student enrollment of 687).

> 360.8 hours of classroom data

> 438,331 student-facing frames and 733,517 instructor-facing frames were processed live, with a further 18.3M frames infilled after class to bring the entire corpus up to a 15 FPS temporal resolution.

> We randomly pulled 100 student-view frames (containing 1797 student body instances) and 300 instructor-view frames (containing 291 instructor body instances; i.e., nine frames did not contain instructors) from our corpus.

> This suset is sufficiently large and diverse (不敢苟同。。)

> To provide the ground truth labels, we hired two human coders, who were not involved in the project. (和我们的数据标注工作相比，EduSense这点工作量很单薄)

> It was not possible to accurately label head orientation and classroom position (有很多指标是粗略估计，但是位置如果采用我们的行列表示来评估，会更精确测量和评价)

5.2 Body Keypointing Results

> EduSense found 92.2% of student bodies and 99.6% of instructor bodies. (实际教室场景中的测试还是囿于少量数据之中，缺乏说服力)

> 59.0% of student and 21.0% of instructor body instances were found to have at least one visible keypoint misalignment (真实效果不一定好)

> We were surprised that our real-world results were comparable to our controlled study, despite operating in seemingly much more challenging scenes (作者分析，和实验场景中刻意控制的复杂姿势和头部朝向相比，真实场景尽管更混乱（chaotic），但学生们一般都是直视前方，且姿态总是保持倚在课桌上，更容易识别)

5.3 Face Landmarking Results

> 仍旧是在部分数据集上分别统计了学生和老师的人脸检测准确率，以及相应的人脸关键点定位准确率（缺乏在大规模标注的数据集上的测试结果）

> 作者提到尽管真实场景更复杂，人脸检测算法的结果还是相当鲁棒的（这是公共算法的功劳，这里提及的意义何在？）

5.4 Hand Raise Detection & Upper Body Pose Classification

> Hand raises in our real-world dataset were exceedingly rare (毫无意外，上述22个视频的测试量，以及大学课堂场景，注定了举手样例是稀缺的)

> Our of our 1797 student body instances, we only found 6 body instances with hand raised (representing. less 0.3% of total body instances). Of those six hand raised instances, EduSense correctly labeled three, incorrectly labeled three, and missed zero, for an overall true positive accuracy of 50.0%. There was also 58 false positive hand raised instances (3.8% of total body instances). (举手姿势的效果惨不忍睹)

> 其他姿势的实测效果也不是很好，且同样存在数据量少、缺乏说服力的缺陷

5.5 Mouth Smile and Open Detection

> Only 17.1% of student body instances had the requisite mouth landmarks present for EduSense’s smile detector to execute. (有效数据更少了) --（Student）smile vs. no smile classification accuracy was 77.1%

> Only 21.0% of instructor body instances having the required facial landmarks. (同样少了很多测试数据) --（Instructor）smile vs. no smile classification accuracy was 72.6%

> mouth open/closed detection, accuracy was stronger – 96.5%（Student）和 82.3%（Instuctor）(注意，其中大部分都是闭嘴的样例，约占94.8%)（这里作者分析道：张嘴检测和微笑相比，更不易察觉）

> 最后，作者还是提到分辨率的问题，张嘴/闭嘴检测还是强依赖嘴的分辨率，另外，标注者对张嘴的判断也有会有主观性的（subjective）干扰。所以，这个指标只是初步性的（preliminary）

5.6 Sit vs. Stand Classification

> We found that a vast majority of student lower bodies were occluded, which did not permit our classifier to produce a sit/stand classification, and thus we omit these results (实际测试阶段，没有包括学生的坐下/站立分类指标)

> 教师也只有66.3%的帧能被检测到下半身关键点，其中坐下和站立的识别准确率粉笔是90.5%和95.2%（数据量较少，可信度如何？）

5.7 Speech/Silence & Student/Instructor Detection

> 关于Speech/Silence分类，作者分别选择了"50段5秒长的有声"和"50段5秒长的无声"，用来测试准确率，最终结果是82%

> 关于Student/Instructor Detection，作者的方法是选择”25段10秒长的教师声“和”25段10秒长的学生音“，结果只有60%的准确率能分别说话者（意料之中，接近50%的随机猜测概率）

> 作者认为，现阶段的说话人识别受到教室的结构和麦克风采集位置的影响很大，而仅有两个语音采集设备也是不够的。想解决该问题只能引入更复杂的方法：说话人识别 speaker identification

5.8 Framerate & Latency

> 详细的耗时分析参见Figure 15

> We achieve a mean student view processing framerate of between 0.3 and 2.0 FPS. （现阶段线下视频的处理速度有这么快吗？）教师路2~3 times faster

> 根据耗时分析，实时系统的处理延时为3~5秒，其中各个部分耗时长短依次是：IP cameras > backend processing > storing results > transmission (wired network)

> 作者认为，未来更高端的 IP cameras将会减少时延，促进实时系统的大规模应用（5G + 高端嵌入式摄像头处理芯片？）

※ End-user Applications

> Our future goal with EduSense is to power a suite of end-user, data-driven applications.

> 如何设计前端的展示页面，也很讲究，作者提出了几种可能的选择

● tracking the elapsed time of continuous speech, to help instructors inject lectures with pauses, as well as opportunities for student questions and discussion. （教师音检测+计时？）

● automatically generated include suggestions to increase movement at the front of the class (教师轨迹？)

● and modify the ratio of facing the board vs. facing the class. （教师朝向比例？）

● a cumulative heatmap of all student hand raised thus far in the lecture, which could facilitate selecting a student who has yet to contribute （举手热力图？）

● a histogram of the instructor's gaze could highlight areas of the classroom receiving less visual attention （教师视线追踪+统计？）

> 除了课上提供实时反馈的系统设计，课下和每个学期末的分析总结报告，也很重要（制作成PDFs，并email给特定人群）

> 紧接着，作者继续重申EduSense检测教师指标并提供实时意见可能起到的积极作用（包括gaze direction [65], gesticulation though hand movement [81], smiling [65], and moving around the classroom [55][70]）

> A web-based data visualizer (Figure 16): Node.js + ECharts + React (前端框架)

Fig. 16. Preliminary classroom data visualization app. (Visualization Dashboard)

※ Discussion

> Taken together, our controlled and real classroom studies offer the first comprehensive evaluation of a holistic audio- and computer-vision-driven classroom sensing system, offering new insights into the feasibility of automated class analytics (句子很长，口气很大)

> 经过实验和实测，作者给出了一些布置应用场景的建议：比如不要选择过大的教室（前后最大不要超过8M）；摄像头安装在合适的位置提供好的教室视角

> 作者指出，系统中的算法错误具有传递效应，文中已经分阶段分部分阐述了各个模块的上限和下限。

> 作者接着指出，系统还有很多工作要做，系统的完善需要公共社区的研究不断提供帮助，且需要同大学和高中等终端使用者们多接触多沟通。

> We also envision(展望、想象) EduSense as a stepping stone towards the furthering of a university culture that values professional development for teaching (美好的愿景，同时也是我们对自己在做的系统的愿景)

※ Conclusion

1. We have presented our work on EduSense, a comprehensive classroom sensing system that produces a wide variety of theoretically-motivated features, using a distributed array of commodity cameras. （贡献）

2. We deployed and tested our system in a controlled study, as well as real classrooms, quantifying the accuracy of key system features in both settings. （分析）

3. We believe EduSense is an important step towards the vision of automated classroom analytics, which hold the promise of offering a fidelity, scale and temporal resolution, which are impractical with the current practice of in-class observers. （愿景）

4. To further our goal of an extensible platform for classroom sensing that others can also build on, EduSense is open sourced and available to the community. （号召）

3、新颖点

毋庸置疑，EduSense是非常出彩的智能课堂感知系统，尽管我在通篇的分析中有意将其与我们的技术路线进行比对，并指出了对其中多处技术细节的质疑，但仍然不能改变它是第一个全方位地将现代AI技术运用到课堂视频分析中的完整系统。与之前出现的其他各种同类系统相比，EduSense有以下几点奠定了其位置：

1）真正地将使用高性能GPU的AI算法引入到图像视觉理解中，而不是之前的假人工智能（比如完全依赖硬件感知，或使用泛化性差不够鲁棒的传统机器学习方法，甚至依赖人工统计）；

2）尽量多地从多个维度和模态来分析学生的课堂状态，包括行为、语音、表情等特征，而不是之前各类系统孤立地从某个特征来片面地研究和感知课堂；

3）文章的写作框架很值得赞赏和借鉴。与传统的CV算法文章不同的是，UbiComp似乎更注重描述一个完整的可执行的系统，需要作者尽量从框架设计、算法细节、工程布置和应用案例等多个方面，来介绍提出的系统方法。可能这不是文章的新颖之处，但这至少是我首次接触UbiComp会议论文的首次感受，这些的写作方式很贴合系统工程性文章。

4、总结

实话实说，EduSense在一些技术路线上并非最优，且很多技术细节上存在诸多漏洞或自相矛盾，但瑕不掩瑜，审稿委员会还是给其出版，只能说第一次吃螃蟹是难能可贵的。这之后，我再尝试发表我们的工作（先后叫做AIClass，或EduMaster，或StuArt），都未能成功。除了被评审免不掉地与EduSense进行对比，并指出差别性不大(实际上有很多技术细节不同，但没有评审愿意去了解这些了)，还包括各类学生隐私性问题，这当然也与我们轻视了隐私安全有关，但更重要的变数可能来自日趋严格的防隐私泄露态势，无论中西方。因此，我们也许不会再纠结于在UbiComp上发表EduSense 2.0了，但只要我们继续深入挖掘智慧课堂分析与评价领域，一定能做出一些差异化的成果，彼时再讨教UbiComp亦不迟。