通过挖掘外太空数据搜索外星人信号

作者: qiufeng1ye | 来源:发表于2021-09-14 17:12 被阅读0次

通过挖掘外太空数据搜索外星人信号
霍金警告‖无法预知的未来，你怕死吗？
数据挖掘导论
简单回答
微信添加好友途径消息类型
数据挖掘
基于拉勾网的成都市数据科学职位分析报告
python生成数据
如何调研客户痛点？
绿湾科技

SETI Breakthrough Listen - E.T. Signal Search - 原文链接
（部分资料和图片引用“深度之眼”）

封面

“Are we alone in the Universe?”
“我们在宇宙中是孤独的吗?”

It’s one of the most profound—and perennial—human questions. As technology improves, we’re finding new and more powerful ways to seek answers. The Breakthrough Listen team at the University of California, Berkeley, employs the world’s most powerful telescopes to scan millions of stars for signs of technology. Now it wants the Kaggle community to help interpret the signals they pick up.
这是人类面临的最深刻、最持久的问题之一。随着技术的进步，我们正在寻找新的、更有效的方法来寻找答案。加州大学伯克利分校的“突破聆听”团队使用世界上最强大的望远镜扫描数百万颗恒星，来找寻技术迹象。现在，它希望 Kaggle 社区能够帮助他们解释所接收到的信号。

The Listen team is part of the Search for ExtraTerrestrial Intelligence (SETI) and uses the largest steerable dish on the planet, the 100-meter diameter Green Bank Telescope. Like any SETI search, the motivation to communicate is also the major challenge. Humans have built enormous numbers of radio devices. It’s hard to search for a faint needle of alien transmission in the huge haystack of detections from modern technology.
“聆听小组”是地外文明探索计划(SETI)的一部分，他们使用地球上最大的全动射电望远镜，直径100米的绿岸望远镜。像任何 SETI 搜索一样，交流的动机也是主要的挑战。人类已经制造了数以万计的无线电设备。在现代科技发现的巨大干草堆中，很难找到一根微弱的外星信号传输针。

Current methods use two filters to search through the haystack. First, the Listen team intersperses scans of the target stars with scans of other regions of sky. Any signal that appears in both sets of scans probably isn’t coming from the direction of the target star. Second, the pipeline discards signals that don’t change their frequency, because this means that they are probably nearby the telescope. A source in motion should have a signal that suggests movement, similar to the change in pitch of a passing fire truck siren. These two filters are quite effective, but we know they can be improved. The pipeline undoubtedly misses interesting signals, particularly those with complex time or frequency structure, and those in regions of the spectrum with lots of interference.
当前的方法使用两个过滤器在干草堆中搜索。首先，“聆听小组”会穿插对目标恒星的扫描和对天空其他区域的扫描。任何出现在两组扫描中的信号可能都不是来自目标恒星的方向。其次，管道会丢弃频率不变的信号，因为这意味着它们可能就在望远镜附近。运动中的声源应该有一个暗示运动的信号，类似于经过的消防车警报器的音调变化。这两个过滤器是相当有效的，但我们知道他们可以改进。毫无疑问，管道会漏掉一些有趣的信号，特别是那些时间或频率结构复杂的信号，以及频谱中干扰较多的信号。

In this competition, use your data science skills to help identify anomalous signals in scans of Breakthrough Listen targets. Because there are no confirmed examples of alien signals to use to train machine learning algorithms, the team included some simulated signals (that they call “needles”) in the haystack of data from the telescope. They have identified some of the hidden needles so that you can train your model to find more. The data consist of two-dimensional arrays, so there may be approaches from computer vision that are promising, as well as digital signal processing, anomaly detection, and more. The algorithm that’s successful at identifying the most needles will win a cash prize, but also has the potential to help answer one of the biggest questions in science.
在这个比赛中，运用你的数据科学技能来帮助识别突破聆听目标扫描中的异常信号。由于目前还没有外星信号的确切例子可以用来训练机器学习算法，研究小组在望远镜提供的大量数据中加入了一些模拟信号(他们称之为“针”)。他们已经识别出一些隐藏的针头，这样你就可以训练你的模型去发现更多的针头。这些数据由二维阵列组成，因此计算机视觉、数字信号处理、异常检测和其他技术可能有很大的应用前景。成功识别出最多针头的算法将赢得现金奖励，但也有可能帮助回答科学界最大的问题之一。

数据分析

首先，整体训练集中标签的分布如下：

标签的分布情况

从上图可以看出这是一个非常不平衡的数据集。
其中有插"针"(TARGET=1)的频谱图如下：

TARGET=1)

没有插"针"(TARGET=0)的频谱图如下：

TARGET=0

其实上面两张范例肉眼可以大概分辨得出来TARGET0和TARGET1的差别。

EfficientNet模型

我们模型使用当前算是sota的影像分类模型EfficientNet。论文链接：EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks，https://arxiv.org/abs/1905.11946

模型性能对比

EfficientNets是google brain的工程师该模型的基础网络架构是通过使用神经网络架构搜索（neural architecture search简称NAS）设计得到，这里就要讲一下NAS技术。

网络架构

对于NAS 这个task 来说，其实最直觉的方法就是我不断的从search space 当中取出不同的neural architecture ，并且实际的训练之后来获得真正的performance，借着不断的重复这个动作，当我穷尽整个search space 时，我理所当然的就可以得到这个search space 当中最好的那个neural architecture。

而也因为这个简单的概念，所以最早的卷积神经网络（ConvNets）通常是在固定的资源预算下发展起来的，如果有更多的资源可用的话，则会扩大规模以获得更好的精度，比如可以提高网络深度(depth)、网络宽度(width)和输入图像分辨(resolution)大小。但是通过人工去调整depth, width, resolution 的放大或缩小的很困难的，在计算量受限时有放大哪个缩小哪个，这些都是很难去确定的，换句话说，这样的组合空间太大，人力无法穷举。

基于上述背景，该论文提出了一种新的模型缩放方法，它使用一个简单而高效的复合系数来从depth, width, resolution 三个维度放大网络，不会像传统的方法那样任意缩放网络的维度，基于神经结构搜索技术可以获得最优的一组参数(复合系数)。从下图可看出，EfficientNet不仅比别的网络快很多，而且精度也更高。
我使用的是efficientnet_b0,可以在timm这个套件直接import：

https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/efficientnet.py

用tensorboard画出来大概长这样

计算图

Baseline代码实践

导入包

# Libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import codecs
import os
import glob
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import cv2
from tqdm import tqdm
from colorama import Fore, Back, Style
r_ = Fore.WHITE
from plotly.offline import iplot
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from skimage.io import imshow, imread, imsave
from skimage.transform import rotate, AffineTransform, warp,rescale, resize, downscale_local_mean
from skimage import color,data
from skimage.exposure import adjust_gamma
from skimage.util import random_noise

数据加载

我们这里定义一个数据加载的函数，以便后去方便读取数据：

def get_train_filename_by_id(_id: str) -> str:
    return f"../input/train/{_id[0]}/{_id}.npy"

def show_cadence(filename: str, label: int) -> None:
    fig, axes = plt.subplots(6, 1, figsize = (16, 10))
    ax = axes.ravel()
    arr = np.load(filename)
    for i in range(6):
        
        ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
        ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        if i != 5:
            ax[i].set_xticks([])
            
    fig.text(0.5, -0.02, 'Frequency Range', ha='center', fontsize=18)
    fig.text(-0.02, 0.5, 'Seconds', va='center', rotation='vertical', fontsize=18)

    plt.suptitle(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
    fig.tight_layout()
    plt.show()

下面是自定义数据集对象：

class ClassificationDataset:
    def __init__(self, image_paths, targets): 
        self.image_paths = image_paths
        self.targets = targets

    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, item):      
        image = np.load(self.image_paths[item]).astype(float)
        targets = self.targets[item]
        image = image / np.array([np.abs(image).max() for i in range(6)]).reshape(6,1 ,1)
        return {
            "image": torch.tensor(image, dtype=torch.float),
            "targets": torch.tensor(targets, dtype=torch.long),
        }

分类模型定义

定义一个基于efficientnet的分类器

class enetv2(nn.Module):
    def __init__(self, backbone, out_dim):
        super(enetv2, self).__init__()
        self.enet = enet.EfficientNet.from_pretrained(backbone)
        # self.enet.load_state_dict(torch.load(pretrained_model[backbone]))
        self.myfc = nn.Linear(self.enet._fc.in_features, out_dim)
        self.enet._fc = nn.Identity()
        self.conv1 = nn.Conv2d(6, 3, kernel_size=3, stride=1, padding=3, bias=False)

    def extract(self, x):
        return self.enet(x)

    def forward(self, x):
        x = self.conv1(x)
        x = self.extract(x)
        x = self.myfc(x)
        return x

模型训练

baseline_name = 'efficientnet-b3'

models = []
device = "cuda"
epochs = 4
Batch_Size = 32
X = df_train.img_path.values
Y = df_train.target.values
skf = StratifiedKFold(n_splits=5, random_state=1024)
fold = 0

for train_index, test_index in skf.split(X, Y):
    
    model = enetv2(baseline_name, out_dim=1)
    model.to(device)
    model = nn.DataParallel(model)
    
    train_index, test_index = train_index[:], test_index[:]
    
    train_images, valid_images = X[train_index], X[test_index]
    train_targets, valid_targets = Y[train_index], Y[test_index]

    train_dataset = ClassificationDataset(image_paths=train_images, targets=train_targets)
    valid_dataset = ClassificationDataset(image_paths=valid_images, targets=valid_targets)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=Batch_Size,shuffle=True, num_workers=12)
    valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=Batch_Size,shuffle=False, num_workers=12)

    optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
    train(train_loader, valid_loader, model, optimizer, device, fold, 5)
            
    models.append(model)
    fold += 1
    print('')

总结

这篇文章希望能够帮助图像新手入门分类模型，我们主要学习当前sota的影像分类模型efficientnet,和优化器，以及交叉验证等等，其中完整代码包含MixU。

通过挖掘外太空数据搜索外星人信号
SETI Breakthrough Listen - E.T. Signal Search - 原文链接[htt...
霍金警告‖无法预知的未来，你怕死吗？
神秘信号从外太空发来，霍金的警告“谨慎对待外星人发来的信号”引起热议。是否存在外太空的生命？这是一个亘古不变的话...
数据挖掘导论
一、数据挖掘的定义指通过从大量数据中通过算法搜索隐藏与其中信息的过程。其中机器学习是支撑数据挖掘的主要手段。二...
简单回答
什么是数据挖掘数据挖掘一般是指从大量的数据中通过算法搜索隐藏于其中信息的过程。数据挖掘通常与计算机科学有关，并通...
微信添加好友途径消息类型
添加好友途径 `0` : 通过微信号搜索 `1` : 搜索QQ号 `3` : 通过微信号搜索 ...
数据挖掘
挖掘数据的典型应用场景有搜索排序、关联分析以及聚类，下面我们一个一个来看，希望通过今天的学习，你能够了解数据挖掘典...
基于拉勾网的成都市数据科学职位分析报告
本文通过查询相关资料，明确了市场上数据科学的相关职位，继而确定了搜索职位为：数据分析师、算法工程师、数据挖掘、数据...
python生成数据
数据可视化指：数据可视化指的是通过可视化表示来探索数据，它与数据挖掘紧密相关。数据挖掘：数据挖掘指的是使用代...
如何调研客户痛点？
调研思路（1）当你的产品不被客户需要，这是我们必须依靠大数据挖掘客户痛点。数据搜索法――网页搜索+评论搜索通...
绿湾科技
职位关键词：数据挖掘；搜索算法；数据建模；（1）数据挖掘工程师岗位职责： 1、负责研发机器学习技术，深挖大数据...