Entity disambiguation is the task of properly identifying and linking the identity of entities mentioned in text. In an ideal world, entities would be perfectly identified every time. However, until that day comes, there will be some circumstances in which entities are not combined according to a user’s preferences.
Currently there is no way for users to manually combine entities they have identified as erroneously un-combined by the system or delete (hide/filter out) “bad” entities. Additionally, Quid has no feedback mechanism that would allow us to learn from these combinations and deletions.
By now, this is a well-known issue for our users. Below I’ve further described the specific pain points related to this problem:
Entities (people, companies, products, etc.) that are actually the same have not been properly combined, and there is no way for a user to manually repair these combinations.
Companies: “Apple” vs. “Apple, Inc.”
People: “Mehmet Oz” and “Dr. Mehmet Oz” and “Dr. Oz”
**Titles (Jr., Sr., President, Count, Dr.) are a big problem, so this happens for people a lot
Affects: primarily News and Blogs dataset
“Top* companies” or “top* people” mentions are not necessarily correct if entities are improperly combined
*We don’t currently have a “top” algorithm. “Top” is a label to show the top of the list of entities, so the interpretation as “top” is accurate 80% of the time; there are some case in which it’s wrong
Users have to remake bar charts in Excel, using Quid data but manually combining necessary entity values
Search results may be affected; if entities aren’t combined properly, a user may not get all the results for a given entity
Users become frustrated with, mistrustful of data quality
Some entities are just “bad.” Currently, the only way to remove things is to filter them out; there is no deleting.
Quid did a competitive analysis of several airlines, and there were certain things they just didn’t care about, wanted to easily delete/hide/filter out
“Reuters” is marked as an entity but it was actually the republisher of the article, not the topic
A company is identified as a person (or vice versa)
An acronym is mapped to the wrong entity (e.g. “IoT” as “Institute of Transportation” instead of “Internet of Things”
A phrase is considered a unique entity (e.g. “Michael Douglas Barack Obama” from a list of people)
The statistical method failed (a sentence like “In France, democracy was founded in 1789” could trick the algorithm into marking “democracy” as a company)
Affects: overall data quality
The article is the filtered unit, not the entity. Filtering out entities actually filters out the entire article the entity is associated with, so if other entities are also associated with that article, they are affected.
Article may still contain relevant / accurate entities, but it just wasn’t extracted out as the “top” one.
There is no “learning” related to improperly combined/differentiated entities. If something was fixed once by a user, it won’t be fixed the next time; they would have to make the same combination (or separation) again. (These fixes could be “learned” on a user basis, an account/client basis, and/or a global basis)
Affects: primarily News and Blogs dataset
Implications: Users will become frustrated if they have to keep making the same corrections every time
Crunchbase and CapIQ (our sources for the Companies dataset) may have overlapping companies that we missed when doing the initial entity combination.
Affects: primary Companies dataset
Implications: The number of improper company combinations may be even higher than it was, and there’s currently no way for a user to fix this
In Opus, unless entities are phrased exactly the same, they won’t be combined.
Affects: Opus users
Implications: Users can’t manually correct these combinations
Open Questions and Considerations
Should users be able to separate entities that they combine? How?
How will users know which entities are the result of a manual combination? How will they know which entities were combined to achieve this ultimate entity?
At what point in the Quid workflow should we allow / prompt users to start assessing and combining entities?
At what level should users be able to combine entities?
Globally, always per user
Globally, always per client
Globally, always for all users
When are users after a categorization change (e.g. subtypes like “tv personality” vs. “business person” vs. “politician” for Donald Trump) vs. a full merger?
We currently only expose some entity “classifications” that Alchemy provides—People, Company, Institution, Location—but there are others available, which could be highly relevant depending on the use case—Field Terminology, Drugs, Health Conditions, Products. How can the classifications we show evolve to support various use cases?
If a user were to delete a node, it would affect all the entities contained in that node (article). Since the node/article is currently the focal point of activity, if a user would filter or delete an entity*, they’d be filtering or deleting the whole node/article. Is there a way to “delete the entity” by deleting the record of mention or significance of the entity from the node/article rather than deleting the whole node/article itself so the other entities in that node/article remain unaffected?
*A user may want to delete an entity if it is badly tagged or not relevant to the user’s analysis
Alchemy (News/Blogs, Companies, Patents) and Basis (Opus) are services we use for entity extraction (and top keywords). They do the entity disambiguation (except in Opus; we don’t have disambiguation there yet)
Our sums (e.g. “Top Companies”) are based only on Primary Mentions, not on All Mentions. We can only access All Mentions for an entity by:
Filtering by the entity and “any mention”, BUT there’s no way to select this group
Finding the entity via search, BUT if entities aren’t de-duplicated, then it doesn’t show everything that’s applicable
For each article’s entities, we only show the top X number; there are no thresholds. So, even if entity 5 and entity 6 are scored the same, entity 6 is cut off if we only show the top 5.
We only have graphs (bar charts and timelines) that show “Top” entities mentioned, not “Any”/”All” mentions of an entity because there would be duplicate representation of the same article
One potential solution for this is to make the bars of an “Any”/”All” graph unclickable, so the user can see the number of mentions but not select an article/node in the bar
MVP Design Proposal
Current design specs can be found via this Invision link: https://quid.invisionapp.com/share/M8C61O0FH