Your guide to what's next.
Home › Eckher Insights › How a custom solution helps Facebook's engineers discover the data they need
Nov 4, 2020

How a custom solution helps Facebook's engineers discover the data they need

The story of Nemo, Facebook's internal data discovery engine.

Large organisations like Facebook are "packed to the gills" with various types of data artefacts: tables that store raw data, AI data sets, dashboards, and many other resources. As these companies continue to grow, so does the distance (physical and organisational) between the teams that create the data and the teams that need to be able to find and consume it.

When it is hard to find the most relevant and accurate information, it is hard to make an informed decision and take action—a common challenge faced by companies of all stripes and sizes around the world. To address this, some organisations turn to various off-the-shelf data management solutions while others embark on building and maintaining their own custom search systems that facilitate internal data discovery at scale. One such system, Facebook's Nemo, hits the nail precisely on the head by taking into account all the intricacies of the company's vast data landscape.

The search engine for data

While most of Nemo's intriguing implementation details remain largely unknown, the platform description published in the company's engineering blog does shed some light on its overall architecture.

At a high level, Nemo consists of two main components: indexing and serving. Its primary search backend is the inverted-index system called Unicorn which is also used for many other projects at Facebook including the very social graph and which replaces the Elasticsearch search engine used by Nemo's predecessor. The old data discovery solution only supported plaintext-based search and could not keep up with the growing amounts of data while maintaining the quality of search results.

As with all search engines, an important part of Nemo's implementation is its ranking system. Nemo's ranking process is known to incorporate various sophisticated signals that reflect the properties of the indexed data artefacts such as recency (freshness of the data), quality (how likely it is that the result is a reliable source of data), and usage (how often the table has been accessed over the past month). The ranking process also takes into account the user's role within the company which is used to return more personalised and therefore relevant search results.

Additionally, the search can be performed using natural language queries, e.g. "How many weekly active users are there on Instagram?"—which are parsed by a spaCy-based NLP library and answered by pointing to the tables that contain the relevant data. This follows one of the recent trends in search—the shift towards fulfilling the user's intent rather than simply finding keyword matches.

Bottom line

A sophisticated data discovery engine, Nemo makes sure that the right data is quickly put in the right hands and supports the decision-making process and analysis performed by Facebook's data engineers, product managers, production engineers, and other users. It incorporates a variety of search signals to surface the most relevant, accurate, recent, and trusted results, and thus promotes data health and trustworthiness across the organisation.

See also
What does a knowledge engineer do?
An overview of knowledge engineering and the core competencies and responsibilities of a knowledge engineer.
Data discovery at Uber: The continued success of Databook
How Uber's in-house platform powers discovery, exploration, and knowledge at scale.
Linked data for the enterprise: Focus on Bayer's corporate asset register
An overview of COLID, the data asset management platform built using semantic technologies.
Document understanding: Modern techniques and real-world applications
How document understanding helps bring order to unstructured data.
Navigating unstructured data: The rise of question answering
Question answering technologies are key to efficiently dealing with overwhelming amounts of unstructured data.
Data exploration on linked COVID-19 datasets
An overview of the available RDF datasets and discovery tools for COVID-19.
Your guide to what's next.
Copyright © 2021 Eckher. Various trademarks held by their respective owners.