I have done a couple of PIM projects based on InRiver PIM, where I implemented both inbound and outbound connectors. The inbound connectors would import data from XML files and parse metadata from images. The outbound connectors would export catalog data to systems like EPiServer Commerce, Demandware (now Salesforce Commerce Cloud) as well as Azure Blob Storage (in a custom data format).
From all that experience I have learned a lot about implementing flexible, generic and efficient data processing, remembering all the performance tricks I have picked up in the past.
In this post I would like to share these six main tips:
- Make mapping of fields configurable
- Outbound connectors are stateless listeners
- Load entities in bulk
- Adhere to a schema
- Serialize data by enumeration
- There is no local file system in iPMC
These tips do not only apply to InRiver outbound connectors. I applied a few of them when I implemented a catalog exporter from EPiServer Commerce.
Make mapping of fields configurable
An outbound connector will often act as a converter from one structure of entities and fields to another structure. If it was a connector for EPiServer Commerce, it could take the value of a ProductName field and output it to a field called Name.
Configuration of the whole connector should not be hardcoded, nor should it be configured using a whole lot of individual extension settings.
To control which languages to export, which fields to use for custom and system fields, which image sizes to export etc. could be managed in a XML mapping document. Every method that constructs a data structure from InRiver entities and fields, can then look up the configuration in this mapping document.
In a recent solution, where I built a complete outbound connector for a customer, I created such a mapping document, along with a XSD schema, and had C# classes autogenerated. In the connector, I then preloaded that mapping document, and cached it for the lifetime of each invocation of the connector.
Whenever the client wished to add a field to the output, chances were high that the field could just be added to the mapping document. Not a whole lot of effort.
However, there is a small catch. We can deploy only assembly files to the iPMC. Because of this, the XML document need to be included in the assembly as a compiled resource. Alternatively, it can be stored as an extension setting, which will then automatically be injected in the extension.
Outbound connectors are stateless listeners
In past versions of InRiver a typical outbound connector would have its own state. It would start up and keep running until stopped. While running, it could keep data in its own variables.
In iPMC we can build an outbound connector by implementing the IChannelListener interface. There are also other listener interfaces, which can be useful for various other purposes.
All listeners are stateless, per definition. They cannot be started or stopped, and they are invoked only when one of the events are triggered. Because of this, every action should be able to run isolated, all settings should be quickly loaded and the implemented actions should execute fast.
The outbound connector will receive events for every single update to every single entity in a channel, and unless the destination system accepts individual incremental changes, the connector needs to queue up incremental changes and export them in bulk at certain intervals.
As a connector was stafeful in the past, it could queue single changes in an internal queue list. Because an iPMC connector has no state, we can do something like this:
- On full publish, the connector loads and exports everything at once (no queuing).
- On all other events, the connector records the change in the ConnectorState API.
- A ScheduledExtension, will run at predefined intervals. It will take all the recorded changes from the ConnectorState, resolve all entities that are affected by the changes and then export them.
Load entities in bulk
At some point the connector needs to load entities from the InRiver Remoting API in order to export them to another format. Always load entities in bulk before iterating and transforming them. Never iterate a list of IDs to load the entities individually.
The ChannelListener gets invoked to handle each single entity event. It could be something like an entity that is added to a channel or an entity that is linked to another entity in a channel. It might seem like these events only affect individual entities that can be handled alone, but they rarely are.
If the affected entity is something like a channel node or a product, then the connector would typically have to include all of the entities child entities. You would then have to resolve the IDs of the related entities to be able to export them all.
Now we have a list of entity IDs that should be exported. We could easily iterate through this list, load each entity and do the necessary transformation. However, that would lead to a lot of small calls to the API, which is not optimal.
My recommendation is to always preload all the needed entities in the beginning of the process, and store them in the connector as dictionaries or lists. By loading many entities in bulk, the Remoting API can load the data way faster, because of optimized database queries and fewer server roundtrips.
In the connector, consider storing these entities in dictionaries rather than in lists. Some customers might have insane amounts of entities that the connector needs to temporarily store for the export. This is particularly good when publishing a full channel. For something as trivial as looking up all variants by a product entity ID, this can really mean a big difference.
In general, working with that large datasets, we really need to think about using the right data structures and how to work with them. It can save a lot of CPU cycles and memory consumption.
Adhere to a schema
For transforming the entity data into a data structure and format for the destination system, I suggest that any connector adhere to a schema or data contract.
It could be a XML Schema (XSD), a JSON Schema or something else, as long as it describes the supported data structures. From that schema, I would then autogenerate C# classes, to be able to work with strongly typed objects when transforming data in the connector.
Now, sometimes the destination system cannot provide us with a schema. Instead we get a sample data file. In that case I would derive a schema on my own and generate the classes as described before.
Besides being able to generate classes, a schema also serves as a data contract between different systems. So, if we were provided with a schema from the destination system, we can be build a connector and be quite confident that its output will be accepted by that system.
Serialize data by enumeration
After having bulk loaded the entities and transformed them to strongly typed data objects, we need to serialize and write those objects in a defined format.
Serializing a list of objects might be considered a trivial task. However, if that list counts more than a few thousand objects, each containing several fields and child objects, then that approach does not really work.
What happens when serializing a list of thousands of objects with a XmlSerializer into a MemoryStream?
The XmlSerializer will serialize all objects in that list at once, and then write it all to the stream. That approach takes up a lot of memory, since all the objects will be created and kept in that list before the serialization. However, the garbage collector cannot reclaim any of that memory until the serialization is done and the list falls out of scope. This is not optimal.
When I came across that issue, I found that the XmlSerializer behaves differently on objects that implements IEnumerable only. The trick is that the serializer will then enumerate them, serialize and write one item at a time.
A neat thing about this is that the transformation of each item can be postponed until right before it is serialized. To do this, we need to implement iterator methods (returning IEnumerable<T> and using yield statements) for transforming entities into data objects, instead of regular methods that return a complete list of objects. The former will transform and return objects for each iteration only, while the latter will transform everything at once.
A data model might start off with an object that consists of a header element, a list of channel nodes, a list of products and variants and a list of relations between channel nodes and products. Those lists can be really large and contain quite complex objects. By enumerating each of those lists one at a time (one item at a time), all those data objects will be short-lived and only created when necessary.
All of this will result in less memory allocated, since only the objects that are needed for a particular item are allocated. Those objects can also be reclaimed early by the garbage collector, as soon as the serializer moves on to the next item. This is really important to know, if your connector needs to scale and handle large catalogs.
One thing though. The serializer will not serialize any property of an interface type. So, a very simple wrapper is needed for implementing the IEnumerable interface. It should really just contain an IEnumerable instance and return the enumerator of that instance as its own.
There is no local file system in iPMC
When the transformed data is transformed and serialized into the output format, the connector needs to write it somewhere.
In previous on-premise versions of InRiver, the connector would typically write the output data to a file on a local filesystem, along with all the resource files (e.g. images and videos) that were referred to in that file. Then it could all be copied (maybe zipped first) to a destination system. However, in iPMC no connector gets any access to read or write files in the local filesystem.
Instead, iPMC connectors need to store everything in memory and then write the final output to an API or to an external location (for instance, a FTP or SFTP server or a blob storage container). This is why I went so far with the retaining of memory above.
So, now that the data output is written to a location, the connector can move on to write all the referenced resource files as well. In previous versions, the connector could write such files to disk as soon as they were seen. In an iPMC connector this does not work.
Instead, the connector could be made to keep track of the IDs of each referenced resource file. Then when the data output has been written, the connector could iterate through that list, download the content of the resource file and write it to the external location. That was what I implemented in my latest connector.