Understanding Search

“Understanding Search” is a series that I wrote while I worked at Quid. It explores Search as a technical discipline from a layperson’s perspective.

These posts are written for a non-engineering audience by a non-engineer author—with an incredible amount of research and support from engineers who specialize in search technologies, obviously. They explain aspects of how Search works—both in the context of Quid and as a standalone academic discipline—so the broadest-possible audience can understand the implications of future Search-related changes to our current product.

These posts are not written (primarily) for a highly technical audience. There are liberties taken, including certain omissions and oversimplification, for the sake of understanding by the intended audience. These liberties have been vetted thoroughly with search-focused engineers.

These posts also don’t pretend to cover even a fraction of the collective knowledge available about the academic discipline of Search. The topics covered have been explicitly chosen to illuminate the past, present, and future of Search at Quid; even in regards to Search at Quid, they are admittedly non-exhaustive.

 

Understanding the Data Acquisition Process

For Quid's News and Blogs Dataset


Overview

Before news articles can be searched and visualized in Quid, they need to be ingested and formatted to fit a certain structure that makes them usable across the Quid pipeline. This post explains the data acquisition process for Quid’s News and Blogs dataset, which includes the sourcing and structuring of news data for system-wide use.

Figure 1.0: Quid Data Acquisition Process Diagram (as of July 18, 2017)

 

Step-by-Step Process Walkthrough

  1. The acquisition process starts with the Fetcher, which fetches documents (a.k.a. “articles”) from the Moreover [1] API in the form of raw XML files.
     
  2. We create and store a backup of each of these raw XML files.
     
  3. The documents are put into a Kafka topic [2], which is basically a container that holds and enqueues the documents for processing. This first one is called “Kafka Fetched” because it holds all the documents after the fetching phase.
     
  4. The Parser consumes messages in order from the Kafka Fetched queue and handles parsing the raw XML.
     
  5. During the parsing, we take the unstructured XML file and pull out all the fields that Moreover provides—this includes the article title; article body text; and other metadata, like source, published date, and source URL. Then we reorganize that information to fit a specific, pre-defined structure (schema).
     
  6. The output of parsing is a structured (parsed) JSON file [3]. This structure is the same across all documents Quid uses, and these structured documents can be used across the Quid pipeline.
     
  7. The parsed JSONs are put into the Kafka Parsed topic, where they are enqueued for Annotations.
     
  8. During the Annotations phase, we enrich the document with additional fields, called annotations. The Annotator calls out to the annotations service with a piece of text and in return receives the extracted entities and keywords.
     
  9. The annotated JSON files are put into the Kafka Annotated topic and enqueued for further use. This is, in essence, the end of the data acquisition phase.
     
  10. This final structured, annotated JSON file is called the “News Article [4].” This structure is the same across all documents Quid uses, and these structured documents can be used across the Quid pipeline. Depending upon where the document is being used in the pipeline, it undergoes further transformations. (We’ll explore some of these transformations in subsequent posts.)
 

FOOTNOTES:


[1]  Moreover, part of LexisNexis, is the news aggregation service we use to source our news and blogs dataset. Fun fact: Quid gets ~500 documents from Moreover every 20 seconds.

[2]  You’ll see Kafka topics appear several more times throughout this diagram. For the sake of this explanation, you can think of a “topic” like a container. In each case, the Kafka container is named for the phase the documents it holds has just exited. It is helpful to have a separate Kafka container for each phase in the pipeline and name it accordingly because, if something goes wrong, it is easier for the engineer to identify exactly where the problem is occurring.

[3]  For the sake of this explanation, you can think of this parsed JSON file as a blank template with labeled input fields for specific information. Some of this information is filled in during the parsing phase (article title, body, metadata), and some will be filled in during subsequent phases (e.g. entities, keywords).

[4]  This object is technically called the “Searchable News Article,” but there is also another artifact that Search uses later on in the pipeline that is also called the “Searchable News Article” because it is actually what is being referenced when a user performs a search in Quid. Unfortunately, the artifact in the Acquisition phase couldn’t be renamed in the system, but for the sake of this explanation the Acquisition phase artifact is called the “News Article.” This is confusing but a technicality that could be important in case you hear reference of the “Searchable News Article.”