Fixing Python SOAP handling, by using XML and JSONB fields in PostgreSQL

Synchronization of huge amount of data between two systems can sometimes be problematic, especially if one of the systems is accessed through a slow API.

The service we had to integrate with was exposed through SOAP API. The API itself was slow and it had ~4,5 seconds time-lag per 10 records, but there was a bigger problem – Python SOAP libraries/implementations are really, really slow, f.e. +20 sec per request to process them.

So in a case of 500,000 records, it would take approximately one week to get all of the data. So, we had to find a way to boost the synchronization time.

Here are all the details about what we had done to find a suitable solution for this problem.

Attempt 1: Finding a suitable Python library

The first idea as mentioned earlier – was to find a suitable Python library. We found the suds library that allows SOAP to Python objects transformation. Unfortunately, suds library’s last update was from five years ago. After a thorough search, we found a suds-jurko , a fork of the original suds. In theory, module did a query to the server and got the objects. Then we would have to map the received objects to our database. However, the matter of slow synchronization lasted. In practice, the module needed 20 seconds to parse a single API response . Add 4,5 seconds from servers application and you get almost 25 seconds per 10 records. We wanted to faster synchronization, so we abandoned the idea of using suds library.

Attempt 2: Implementing a Java wrapper

The second approach was in some way similar to the first – we could utilise some Java tools and generate whole POJO client. In fact InteliJ IDEA was very helpful with that – it tool 20 minutes to generate simple POC. But it would add another layer of abstraction and technology to the system. So finally we abandoned this approach.

Attempt 3: Dump all XML responses and work with them locally

Finally, we decided to store all API responses (raw XMLs) the way SOAP api delivered them and then think later about import method.

The data dump took roughly 35 hours; for some reason,some of the data had two endpoints on the SOAP server. Finally, we had all raw XML files in our database (PostgreSQL with XML native types). All we had to do is to come up with some kind of mapping and an importer.

The solution

  1. We used Python xmltodict library to transform XML fields into a Python dictionary.

  2. Then, the newly formed Python dictionary would be stored JSONB field in Postgres.

Seems we hit a bullseye. It turns out PostgreSQL is quite good at filtering over nested JSONB fields:

Before index:

syncDB> select count(*) from core_jbjorder where data -> 'client' -> 'delivery_address' ->> 'city' = 'Szczecin';  
| count  |
| 198762 |
Time: 3.279s

After index:

syncDB> create index on jbjorder ((data -> 'client' -> 'delivery_address' ->> 'city'));  
Time: 3.910s  
syncDB> select count(*) from core_jbjorder where data -> 'client' -> 'delivery_address' ->> 'city' = 'Szczecin';  
| count  |
| 198762 |
Time: 0.018s

The use of JSONB allowed us to gradually transfer our data from document-like JSONB field into old, good relational schema.

Lesson learned

Splitting responsibility for data fetching and transformation was a good decision. First was to store all of the raw XMLs to save time before we even came up with import implementation – it is always easier to work with data locally. The second was the application of the JSONB field in PostgreSQL.

So general cue in tackling such problems would be to divide it into smaller problems and attacking them one by one. These way you can focus on simple solutions and engage more team members in creative process of problem solving!

稿源:TEONITE (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 综合技术 » Fixing Python SOAP handling, by using XML and JSONB fields in PostgreSQL

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录