I interned at Lablup for 7 weeks after the Goldman Sachs internship. I wanted to make use of the remainder of the summer break.
Internship Application Process
Lablup was a mentoring company in a program called Contributhon run by Open Source Software Portal in Korea. It was a 6-week program where participants got matched to a company and had to contribute to their open-source software. I wanted to work full-time instead, so I just contacted the founder/CEO of the company, did an interview, and started working.
About Lablup
Lablup is one of the hottest technology startups in Korea right now. It develops and runs Backend.AI, which is a resource management platform for AI research.
Google Colab has limitations because it is not meant for sophisticated usages, like installing programs that run on certain operating systems or downloading large datasets. As it is a consumer-facing service, Google Colab is not used in organizations working on serious AI research. When I worked at Onclusive, I had to start an AWS instance every time I needed GPUs for model training. It was annoying because I ran scripts to download required programs every time, sent files using scp, and killed the instance after I was done.
Lablup has been earning profits mostly by providing the Backend.AI infrastructure support to companies. When I started working, Lablup just launched a closed beta for its consumer-facing service, which is Google Colab but better and paid.
Internship Experience
I worked on small tasks throughout the internship because seven weeks were not sufficient to take on a large project especially due to the complexity of the Backend.AI system. It takes a while to set the right configurations and run the full system on a local machine. I understood the high-level mechanism and how various components interact after four to five weeks into the internship. To run Backend.AI with UI, you need to run six programs at once. Anyway, the following are the tasks I worked on.
Even Allocation of GPU
Some research assumes that multiple GPUs use the same amount of resources. Thus, we sometimes need to allocate GPU evenly instead of using the simple greedy approach. For example, if a user requests 0.3GPUs and we have two 0.2GPUs, we want to allocate 0.15 and 0.15 to the available GPUs instead of 0.2 and 0.1.
I spent the most time developing the algorithm. It was similar to solving a Leetcode problem. Yet, there were many edge cases, so writing a simple and intuitive code was difficult. I implemented the algorithm to work in the system.
Replacement of etcd3
Backend.AI uses asyncio extensively. python-etcd3 is a library that provides asyncio support for etcd; yet, it has not been managed for a while now. There was an error related to grpc when closing the Backend.AI agent. While looking for a way to replace etcd3, I found a library called aetcd3, which is a fork of etcd3aio, which is a fork of python-etcd3. I updated some parts of aetcd3 and made it work in my local environment. However, since changing an open-source library to a new one is a huge task in terms of system stability, my work has not been merged yet. I had a slight taste of what an open-source community is like by working on this issue.
Github Link - Issue
XFS storage proxy
XFS is a file system created by Silicon Graphics. I implemented a storage proxy for XFS. Backend.AI uses a system of virtual folders. It is cloud-like private file storage for individual users where users can create virtual folders to store files, libraries, and code. Of course, folder sharing or various permissions are supported. Underneath the hood, virtual folders are hosted at a file system designated by the program.
The base virtual folder functionalities had already been implemented, and I extended it to XFS. XFS keeps a journal of projects at /etc/projid
and /etc/projects
, so I updated these files when creating or deleting a virtual folder. Since each XFS project has a quota, I updated the quota upon file creation or virtual folder cloning.
This project took the most amount of time because I had to understand how XFS works and how to use basic commands for XFS. Since macOS does not support XFS, I worked on an AWS Ubuntu machine. Learning tmux and vim and working solely on terminal without any visual elements was fun.
Github Link - PR
Virtual Folder Leave
I implemented the functionality for users to leave a shared virtual folder. When a user leaves, the related database is updated and she loses access to it. I wrote corresponding client support so that users can take this action on the Python client. This task felt like a standard backend work of building an API endpoint and relevant features.
Github Link - Issue, PR (manager), PR (client-py)
Virtual Folder Clone
A user may want to clone a shared or group virtual folder to her private virtual folder to add experimental data or to use it as a starting point for a different project. I implemented relevant functionalities on the Backend.AI agent and manager as well as the Python client. It was similar to the Virtual Folder Leave task.
Github Link - Issue, PR (manager), PR (client-py)
Install Script Update
I updated the script that installs all components required to run Backend.AI, including the storage proxy. The goal was to enable a single-run installation. The script also fills the initial configurations for etcd and Postgres. The code was written in a shell script.
Github Link - Issue1, Issue2, PR
Global Dotfiles
Backend.AI allows each user to edit his or her environmental config files like ~/.bashrc
or ~/.zshrc
. Administrators should also be able to edit global shell profiles or config files, like /etc/bash.bashrc
or /etc/profile
at a domain or group level.
I added a field in postgres tables to save dotfiles and corresponding functionalities to create, update, and delete them at Backend.AI manager, agent, and client. I also added some code to copy the dotfiles into the target location upon starting a new kernel.
Github Link - Issue, PR (manager), PR (agent), PR (client-py)
Service Port Validation
Some apps, like jupyter notebook, are opened inside the Backend.AI container. For these apps, we may need to open separate ports. Users can add the app name and the port to the configuration page to designate service ports. One port can be used for a single service, and the user-provided port number must be valid; hence, I added duplicate and format checks for the ports. When a kernel is started, it runs through the check before registering the ports. An invalid port format throws an error. I also updated the UI for this feature so that the user is stopped from saving erroneous port input and be notified of it.
Github Link - PR (manager), PR (agent), PR (console), PR (common)
Last Remark
Lablup internship was very fun and challenging. I have never used most of the technology powering Backend.AI, and learning by implementing was exciting. I was thrilled when I found a solution after reading through the existing code for hours. As I worked on tasks across multiple components (manager, agent, console, client, install script, and storage proxy) of Backend.AI, I got a better grasp of what is going on and experienced a wide spectrum of software engineering.
The three founding members of Lablup are extremely competent. They answered my questions thoroughly and guided me to useful resources. I was also mesmerized by their passion - one member committed at 2:30 am then came to the all-hands meeting the next morning. This internship is an unforgettable experience because it gave me an idea of what it is like to work at a technology-based startup and the desire to work in one. The overall atmosphere and work ethic of the company set a higher bar for me.