Eliminating duplicates in an ever-growing content nightmare

Imagine this: You’re a marketplace website. You rely on content created by users to deliver a quality product to the general public, with minimum intervention from your side. Because at the end, you’re just a connection between the buyer and the seller. Now, how do you deliver the best user experience while relying totally on the goodwill of your publishers?

This is a question we ask ourselves every day of the week (the working days at least) at Avito , Morocco’s leading classified ads website, and which also goes well with Schibsted’s mission of “Empowering people in their daily lives.”

One of the main inconveniences for our users is irrelevant ads. We don’t want our user to have to go through countless pages with the same repeated ads just to find what they’re actually looking for. Sometimes, you even give up and the ideal deal was just one page away. Just like when you’re searching for something on Google and you find yourself reaching the dark and moist realm of what lies beyond the second page of Google’s search results.

So what do we do? How can we solve this issue with limited resources?

And then it dawned on us

We sat down for about 10 minutes and the following conversation took place:

  • Product Manager: I keep stumbling upon the same ads again and again. Maybe we should do something about it? Users are complaining of the redundancy of certain ads.
  • Engineer: Indeed! The other day, I was looking for a raspberry PI and there was a flood of ads about desktop computers.
  • Product Manager: …
  • Engineer: I think we should do something about this…
  • Product Manager: What if we block people from reinserting the same ads?
  • Engineer: How?
  • Product manager: I don’t want your job!
  • Engineer: I’m going to compare each newly inserted ad with the rest of the user’s ads and block it if it was a duplicate.
  • Product manager: Sounds good to me. How much time do you need?
  • Engineer: Two months.
  • Product manager: One month seems logical.
  • Engineer: How?
  • Product manager: Okay, one month it is!

The okayest okay.

So it started, the first version of the Duplicate Detection project.

Vladimir I. Levenshtein

Dr. Vladimir Levenshtein is considered the father of coding theory in Russia. His contributions are present in many aspects of our everyday lives. One of his most well known creations is the foundation of most of today’s spell checking algorithms: “Levenshtein distance.” This was the key to solving our problem.

What a great guy!

How can a spell checker help detect duplicates?

Levenshtein distance is a way to measure the similarity between two strings. It works by calculating the changes necessary to go from a string (A) to a string (B). The greater the distance, the more different the texts are.

Example:

We have two strings, the initial text (I) and the final text (F). We’ll refer to the Levenshtein distance function as DT.

If I = “adil” and F= “adil” then dt(I,F)=0, because the transformation from I to F required 0 steps.

If I= “adil” and F= “adim” the dt(I,F)=1, because the transformation from I to F required 1 step (which is l => m).

The greater the distance, the more different the texts are. This is what we relied on first, to detect when and how duplicate ads are inserted.

That’s a great idea, let’s ruin it

In the following graphic, I will illustrate how the first iteration of the Duplicate Detection project was implemented.

The short explanation

Looks pretty basic, but the devil is in the detail. The big box that blatantly states that a check is performed, is where the magic happens (or should have happened …), and it’s in there the Levenshtein distance is being put to use to solve our duplicate issue.

The long explanation

First, an ad is submitted to the ad-insert process, the necessary checks are being performed (Checking for SQL injections, lol). If an ad is deemed “legit”, then it is submitted to the “duplicate detector” process, in which the ad is ran against the database to check if the user is trying to republish the same ad. The check is done database-wise using a stored procedure that returns a “score.” That score is used to deem the ad “duplicate” or “unique.”

Here is a heavily edited version of the stored procedure that handled it:

CREATE OR REPLACE FUNCTION score_string(i_source TEXT, i_other TEXT)
RETURNS INTEGER AS $$
DECLARE
	l_source TEXT;
	l_other TEXT;
	l_lv INTEGER;
	l_size INTEGER;
	l_max_size INTEGER;
    l_current_size INTEGER;
BEGIN
	-- returns a value form 0 to 100 where 0 is completly different and 100 is total equal
	IF (i_source = i_other) THEN
		return 100;
	END IF;
 
    l_max_size = 255;
 
    -- This loop to reduce the bytes size of the string to 255 (Limitation for levenshtein function)
    -- Substring can return the correct characters size (length) but not the same size bytes (special characters)
    l_current_size = l_max_size;
    LOOP
        l_source := substring(i_source from 0 for l_current_size);
        l_current_size = l_current_size - 1;
        EXIT WHEN octet_length(l_source) < l_max_size;
    END LOOP;
 
 
    -- This loop to reduce the bytes size of the string to 255 (Limitation for levenshtein function)
    -- Substring can return the correct characters size (length) but not the same size bytes (special characters)
    l_current_size = l_max_size;
    LOOP
        l_other := substring(i_other from 0 for l_current_size);
        l_current_size = l_current_size - 1;
        EXIT WHEN octet_length(l_other) < l_max_size;
    END LOOP;
 
	l_size = char_length(l_source);
 
	l_lv := - (levenshtein(l_source, l_other) - l_size) * 100 / l_size;
	if l_lv  100 then
		return 100;
	else
		return l_lv;
	end if;
END
$$ LANGUAGE plpgsql;

Lovely, isn’t it? It was a good solution! I will spare you the details on how it was implemented within the huge system, and jump right into what went wrong.

(N° of users in a group * N° of ads per user * N° of ad parameters) = Timeout

So basically, to get a similarity score, the algorithm must compare the new data with the user’s existing data. However, there’s a catch …

Back in mid-2014 we had a recurring issue regarding the uniqueness of the users. Because of the rise of the telecom bubble in Morocco , getting a new phone number was cheaper than cheeseburgers. Since Avito.ma imposes some restrictions regarding the number of ads that can be posted in certain categories , some users resorted to creating multiple accounts with multiple phone numbers to post the maximum number of ads (users were limited by phone number or by email). So if a user had 3 phone numbers, they could create 3 “throw away” phone numbers and have 3 times the number of allowed ads.

The solution we came up with was to add a new concept called “User groups.” This is based on an algorithm using “certain” parameters to determine whether or not you are the same user using different accounts or not, and places your account in your according group. We let the groups grow on their own with no moderation of human intervention. This proved to be quite handy.

Fast forward to the implementation of the first duplicate detection project. Things ran smoothly and we revelled in watching the number of duplicates drop. We even built new daring projects on top of the duplicate trap. Little did we know that there was a great danger lurking in the shadows of the Database.

Avito.ma is one of the fastest growing websites in Morocco, which meant that whatever code we wrote in 2014 would probably have to be optimized again in 2015, 2016 etc. This was the trap we fell right into. The User groups logic yielded some great results, but also resulted in one of the most dangerous situations we have ever encountered as a tech team: CONSTANT DATABASE TIMEOUTS.

It started in early 2016. We noticed that database queries were taking a bit more time than usual. We snooped around and realized it was the duplicate check that was taking too long. Some more snooping and we figured out something really scary. The grouping algorithm went WILD, and instead of grouping accounts it started grouping groups of accounts. So instead of adding one single account to a group, it added all the accounts in the same group to a new group. We ended up with groups with up to 500 users and some with more than 30,000 ads per group.

Knowing that the duplicate detection system compares all ad data with stored data and assuming that an ad has about 6 parameters (subject, body, price, address…etc), we could have something like (30000 * 6) comparison operations when inserting a single ad. Multiplying that with the insane amount of ads posted daily, we came out with a crazy number of operations per day, just to ensure the uniqueness of ads. Something had to be done, quickly.

The incredible super mega engineering task force

The CTO called for a task force to be assembled after the day the database gave in and decided to timeout all requests. We were literally removed from our desks and put on a different floor to fulfil one task: GET THINGS WORKING AGAIN. Initially we were three people: one super mega devops (AKA the real wonder woman), one insane backend developer (actually insane), and me. We started contemplating different solutions to these multiple issues we were having. How could we propose a magic solution that would get us out of this mess? If only there was a system where we can perform rapid comparison operations, and where all the necessary data could be stored and retrieved immediately – something completely detached from the database.

Photo taken the day of the assembly.

Elasticsearch

According to their official website :

“Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.”

We studied the idea of implementing Elasticsearch (ES) as a duplicate detection system for about two weeks. Quick POCs, boring talks, bitter realizations and fruitless meetings. Then we decided we were going all in with this suggestion.

The project itself was divided into multiple sub projects. Each of us were leading a specific aspect of the implementation:

  • Setting up Elasticsearch cluster and all infra/devops operations ( Dounia Alla )
  • Indexing data in Elasticsearch ( Adil Houdhoud )
  • Using Elasticsearch as a duplicate detection engine ( Hamza Aid )

My task was to ensure that all needed data were indexed properly in ES without affecting database performance.

Indexing data

The first step of indexing the data, was to determine which data to be indexed. Of course for the sake of duplicate detection, we could have indexed just the parameters needed for the check, but that would have been a waste of the potential of this project. So while the project was sold to the management and other stakeholders as a duplicate detection engine, we were secretly working on indexing EVERYTHING we have. For this purpose we needed a good indexing strategy.

The different players

Here’s an overview of the deployed solution:

Database

A Postgres database, nothing special. But we had a secret weapon that facilitated the indexing process. We kept track of all actions performed on ads in separate tables, each ad change had its unique id, timestamp etc., which meant we could easily fully index the ads and update (reindex) changed ads (edits, deletes, deactivations…etc) on Elasticsearch.

Notify

Postgres provided a cool feature which allows sending notifications when certain events are triggered. The official Notify documentation :

“The NOTIFY command sends a notification event together with an optional “payload” string to each client application that has previously executed LISTEN channel for the specified channel name in the current database.”

We sat up some triggers to track all changes to data, and created and function that gets called every time an action is performed on ad.

This is an example of an insert:

IF (TG_OP = 'INSERT') THEN
  PERFORM
                  *
  FROM
                  index_tracking
  WHERE src_id = NEW.id;
  IF FOUND THEN
                  DELETE FROM index_tracking WHERE src_id = NEW.id;
  ELSE
                  INSERT INTO index_tracking (src_id, last_modified, present)
                  VALUES (NEW.id, CURRENT_TIMESTAMP, true);
  END IF;
 
  PERFORM pg_notify('notification_channel', format('%s %s %s', TG_OP, NEW.id, TG_TABLE_NAME));
  • Index_tracking: A table we use to track all indexing actions.
  • NEW.id: The new ID of the inserted ad.
  • ‘Notification_channel’: The name of the channel that the indexing daemon is listening to, used by pg_notify function.
  • TG_TABLE_NAME: The name of the table that triggered the event.
  • TG_OP: The name of the operation performed (insert, update etc.).

Indexing Daemon

This is a service dedicated to indexing (and reindexing) data. It relies mainly on the ‘notify’ feature of Postgres, it listens for the ‘Notification_channel’ we sat in the trigger.

When “notified,” it retrieves the ad’s data from the database using the id sent in the notification. Then it sends the new data to elasticSearch for indexing.

It works using two modes, batch and realtime. For the batch mode it gathers IDs sent in the notifications for a specific period of time (usually 10 to 20 seconds), before selecting the data from the DB to be indexed. Realtime mode on the other hand, is realtime.

ElasticSearch cluster

We’re using Amazon’s elasticSearch cluster service. It’s reliable and spares us a lot of hassle in maintaining and configuring the service.

  • Duplicate Detection microservice

Nicknamed OWL, a REST microservice written in golang (by Hamza Aid ) that handles the qualification of duplicates using two phases. First it stars by getting possible similar ads for a given ad from elasticSearch using Lucene’s moreLikeThis function. Then, it runs the results against the original ad info using a Levenshtein distance function.

We also experimented with some Jaro-winkler based algorithms (props to Soufiane Mghanen ), but at the moment Levenshtein is our favorite.

Here is a snippet of the Jaro-winkler implementation we’re experimenting with, more at https://github.com/xrash/smetrics :

func JaroWinkler(a, b string) float64 {
 la := float64(len(a))
 lb := float64(len(b))
 
 // match range = max(len(a), len(b)) / 2 - 1
 matchRange := int(math.Floor(math.Max(la, lb)/2.0)) - 1
 matchRange = int(math.Max(0, float64(matchRange-1)))
 var matches, halfs float64
 transposed := make([]bool, len(b))
 
 for i := 0; i < len(a); i++ {
 start := int(math.Max(0, float64(i-matchRange)))
 end := int(math.Min(lb-1, float64(i+matchRange)))
 
 for j := start; j <= end; j++ {
 if transposed[j] {
 continue
 }
 
 if a[i] == b[j] {
 if i != j {
 halfs++
 }
 matches++
 transposed[j] = true
 break
 }
 }
 }
 
 if matches == 0 {
 return 0
 }
 
 transposes := math.Floor(float64(halfs / 2))
 
 return ((matches / la) + (matches / lb) + (matches-transposes)/matches) / 3.0
}

And then?

Here we have it, some fancy algorithms along with indexed data in elasticSearch. How is this system exploited? and how does it fit within out existing workflow?

The CTO and the Product Managers cheering for us.

Frontend? Backend?

This duplicate detection or DD in short (too late for that now) is implemented in multiple places within our platform(s). I’ll highlight the implementation in two strategic places:

Frontend detection

The frontline of defense, we call it internally “frontend detection,” but in reality it all happens in the backend. This one is triggered when an ad is inserted. Right after clicking the “submit” button, a query is sent to the duplicate detection microservice (nicknamed OWL) to get a score. Depending on this score, we either accept the ad or reject it and display to the user that she already has a similar ad lurking somewhere (we even point the user to the original ad), and that she can either deactivate the old ad or renew it.

The duplicates queue

All ads are moderated, some automatically and some manually. For this reason, we have set up a duplicate detection module within the moderation system. The DD module crawls the inserted ads for duplicates and flags each one with a specific flag. There are two flags:

  • Black flag: This ad is definitely a duplicate, it is rejected automatically.
  • Potential duplicates flag: This ad might be a duplicate, it is moderated manually and it’s up to the moderator ( a human being) to judge.
  • An ad that is not a duplicate is simply not flagged.

Leap of faith

The entire project was a leap of faith. We didn’t have much time to contemplate all the possible solutions and we put all the possible faith into this project. Boy oh boy, it did pay off.

Database impact:

We went from constant 100% CPU usage and none stop timeouts to an average of < 30% at peak hours. The impact was instant: within minutes the drop was noticed, and virtual champagnes were popped.

Grape juice.

Impact on the content

The amount of complaints we received after the deployment of the DD project was just amazing to watch (not from the customer support perspective of course). The users who were accustomed to spamming the website with duplicates are now furious about the “new” change. The CS team and the Content Product Manager did a great job raising awareness about the importance of having clean content on the website and how beneficial it is for both buyers and sellers.

Here are some aspects regarding the impact of the DD project on content:

  • 70% drop of duplicates within one year period.
  • Improvement of manual moderation time: Time spent moderating duplicates is now spent improving the quality of the content.
  • A slight drop in content but a huge improvement of the content quality.

And that’s it!

Product Managers rejoicing after impact study was published.

All in all, the project was ACE! It positively impacted content, moderation, QA, revenues and many other aspects of our daily lives at
Avito.ma . It solidified our position as the leader of classified websites in Morocco (by far) in both content and quality, and also presented a window for the tech/product team to work on something new and experiment with cutting edge technology.
稿源:Schibsted Bytes (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 后端存储 » Eliminating duplicates in an ever-growing content nightmare

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录