Data Engineer

San Francisco, CA /
Engineering Team /
Data Engineering
Deep Discovery is hiring a Data Engineer to build the infrastructure for acquiring, ingesting, processing, indexing and retrieving the data streams it consumes and that will build our platforms to enable the training, tuning, deployment and monitoring of statistical models behind background checks for banks that our product produces. This is called a Know Your Customer (KYC) system and banks use them to evaluate the risk of doing business with their clients so they don’t face stiff fines from global regulatory agencies as much as $10 billion each year.

Adaptations of this KYC system will also be developed into other products for government regulatory agencies, investor due diligence, supply chain risk assessment, and most importantly in support of our social mission, leading investigative journalists and anti-corruption NGOs around the world.

We are taking a network-centric approach to KYC that evaluates clients in terms of the context in which they do business. This involves building and operating several systems to perform data engineering tasks:

    • A data collection framework for frequently updating web crawls and APIs
    • A workflow system and batch scheduler to run jobs both large and small
    • A system to manage networks of reliable workers to process real-time streams of data, operations and commands
    • An automation system for machine learning operations to enable feature extraction and selection, model training, selection, and tuning, model deployment and monitoring
    • API services to host models and evaluate them on data on the fly
    • A build system for creating a single, unique and verifiable build of the end to end product with all of its systems from a set of source repositories including it's user interfaces
    • A catalog of custom docker images for performing each task
    • Various databases we use to serve the system and its assets: graph, relational, search
    • A rigorous system of quality assurance testing using the above triggered by continuous integration directly from source code
    • Dashboards through which to use, control and understand the above
The ideal candidate has 5+ years of experience in data engineering or machine learning operations, is experienced with graph databases, has early stage startup experience and is excited by the opportunity to define and build the data and machine learning systems driving a mission critical application for the finance industry. Finance experience is a plus. Strong python skills essential. Search experience needed.

Some technologies from our toolkit to see whether we mesh well are: Dagster, Airflow, Kubeflow, Kubernetes, Superset, Luigi, Kafka, Vectorized, Elasticsearch, Faiss, Spark, YARN, PySpark, PySpark, PySpark, Dask, Ubuntu, Debian