Intern Software Engineer - Machine Learning Team

Madrid, Spain
Engineering
Intern
Location: Madrid
Seniority: Intern
Starting date: 1st of July 2018
Duration: Discussed on an individual basis
Compensation: 1,200 Euros/month + accommodation

At source{d} we are building the technology stack for the next generation of Machine Learning powered developer tools. We are an open-core company built around our Open Source projects. We have raised over eight million USD so far, and we are currently growing our team.

Engineering at source{d}

The Machine Learning team is focused on providing intelligent insights into nearly all the world's Open Source code. This includes typical Big Data analysis as well as solving sophisticated Machine Learning problems. The former is conducted on Apache Spark clusters, the latter leverages extensive GPGPU acceleration on custom hardware using deep learning techniques. Often ML problems we face are unsupervised and require novel research in the domain of ML on source code (MLonCode).

• Tools and work is always open-sourced to benefit the community
• Publishing papers and blog posts is encouraged and part of our workflow
• We enjoy releasing datasets
• We regularly speak at conferences but it's not required

source{d}'s ML engineers have released:
• src-d/awesome-machine-learning-on-source-code - everything that we know about MLonCode.
• ml - MLonCode framework.
• apollo - modular source code de-duplication research project.
• kmcuda - lightning fast K-means and K-nearest neighbours on NVIDIA GPUs. It allows us to cluster 3 million samples, each 256 dimensions, into 1000 clusters in less than 20 minutes with the outstanding precision using two NVIDIA Titan X cards.
• minhashcuda - lightning fast Weighted MinHash on NVIDIA GPUs. This tool allowed us to scan all +17M git repositories in 30 minutes and find 1.5M duplicates (the results were
published on data.world.
• lapjv - Jonker-Volgenant algorithm to solve linear sum assignment problems, accelerated with Intel AVX2 instruction set. It is capable of transforming 4096 t-SNE dots into 64x64 images in 2 minutes.
• wmd-relax - optimized Word Mover's Distance
• hercules - Git repositories line burn down analysis command line tool built on top of source{d}'s go-git - Git client and server implementation in pure Go language.
• sparkpickle - the tool to read PySpar RDD files without having to install Spark.

The following notable technical posts, papers and talks exist:
• Source Code Identifier Embeddings
• Open Source Stack for Machine Learning on Source Code
• Analyzing GitHub, How Developers Change Programming Languages Over Time
• GitHub Contributions Graph: Analyzing Pagerank & Proving the 6 Handshakes Theory
• Similarity of GitHub Repositories by Source Code Identifiers
• 397 Languages, 18,000,000 GitHub repositories, 1.2 billion files, 20 terabytes of code: Spaces or Tabs
• Topic modeling of public repositories at scale using names in source code - paper
• Hands on with the most starred GitHub repositories
• Source code abstracts classification using CNN

The following datasets published:
• Names in source code extracted from 13 000 000 GitHub repositories. Not people! 30GB
• October 2016 GitHub repositories not marked as forks but very similar to each other
• Readme files found in all GitHub repositories (16M, October 2016)
• ≈ 452,000,000 commits' metadata taken from 16,000,000 repositories on GitHub (Oct 2016)

Role

• This internship targets students and junior engineers who are interested in the field of Machine Learning on Source Code.
• Strong computer science and Machine Learning background is essential for Machine Learning team members.
• You will be expected to be a passionate, skilful engineer who is able to produce amazing results quickly and reliably.
• Coding skills are important; we are using Python 3 in our research and production prototyping. Besides, we occasionally code in Go, C++ and CUDA but prior knowledge of these languages is not required for the internship.
• Machine learning on source code shares common ideas with Natural Language Processing, so NLP knowledge is considered a plus.
• Any deep learning experience is highly appreciated.

Culture

• source{d} is a company for developers by developers. We firmly believe in always doing what's best for the individual developer in the community. Our team consists of members who are passionate about programming. To understand our culture better, read more about it here.

• At the moment, we are 25+ people from 10 different countries working as a distributed organization. Some of our team members are based in the Madrid or San Francisco office, others work remotely from around the world (Portugal, Estonia, Russia, and others).

• For those wanting to work from one of our offices, we fully support the visa and moving process for you and your family.

• At source{d}, we have a transparent salary policy which we feel strongly about it. Your seniority level will be determined during the last round of on-site interviews.

• At source{d} all of the projects we work on are public on GitHub and the vast majority are open-source under licenses such as Apache 2.0 or GPL3.

• We don't just believe in open-source, we also believe in radical transparency as an organization, there we publish everything about the company at github.com/src-d/guide.

Perks

• We go to conferences and other developer events!
• Open Source Days, every second Monday, you are encouraged to work on any OSS project you choose.
• Flexible hours, set your own schedule that fits you.
• Free books. We will buy any books that help you learn & grow.
• If you choose to work from one of our offices, you will enjoy a comfortable and spacious environment.
• Annual summer and winter Christmas parties and a hackathon retreat are held in Madrid and all team members are flown over for it.
• We also have our own, Open Source craft beers.