Data Collection and Migration with MDACA Data Flow

Computers & TechnologyTechnology

  • Author Anthony Cook
  • Published November 28, 2022
  • Word count 1,137

Collecting and moving large volumes of data can be challenging in any big data ecosystem. When the data forms a critical part of logistical planning and analysis, having the right data available in a timely manner can easily mean the difference between a successful campaign and a failed one. Tracking and reporting the immunizations of our military members is one such case, particularly since it involves sensitive health workload data. Accurate and effective immunization tracking and reporting is increasingly important, especially in today’s world with recent rises in epidemic and pandemic causing illnesses and viral strains. Ensuring service members facing deployment across the globe are protected from these and other infectious diseases not only helps the military to maintain operating efficiency but also minimizes risk of its members becoming carriers of disease both at home and abroad.

It recently became necessary for the Defense Health Agency (DHA) to quickly replace the legacy, disparate immunization tracking and reporting systems for military members and their families with one that is modernized, centralized, and uniformly accessible to all branches. To facilitate that effort, we leveraged the Multiplatform Data Acquisition, Collection, and Analytics (MDACA) Data Flow (“Data Flow'') running on Amazon Web Services (AWS) GovCloud to manage the collection, migration and centralization of immunization records from all military branches into shared data repositories and enterprise information systems. As depicted in Figure 1, Data Flow is a directed-graph engine including hundreds of ready made components for moving data between systems using most commonly used protocols, schemas, and data formats. MDACA Data Flow provided us with a building-blocks like approach that enabled us to model and deploy a working system in a fraction of the time it would have taken to design and code the needed capabilities from scratch.

The first stage of the project required the new solution’s interfaces be functionally identical to the legacy system and, therefore, require no changes in client applications to sustain the tracking and reporting. This made it necessary to continue collecting the data through a variety of legacy communication protocols and messaging schemas while ensuring delivery to the modernized back end. These included:

Supporting ingestion of raw immunization and personnel records sent in a specialized subset of Health Level 7 (HL7) and proprietary fixed-length messaging schemas.

Receiving immunization records both singly and in batches containing thousands of records via streamed and flat-file based delivery through HTTP(S), SFTP, and Amazon S3 protocols.

Converting the raw HL7 and proprietary formatted messages to JSON and Apache Parquet data formats for ETL pipelines feeding the data into back-end databases.

ETL pipelines communicating with back-end databases directly through SQL.

Currently utilized to ingest nearly 9 billion transactions for DHA daily (Figure 2), MDACA Data Flow proved to be an existing, accredited, and natural fit in supporting the needed requirements. Using its drag-and-drop tools and large component library, we quickly assembled pipelines to handle three principal data movement and conversion tasks:

Real-time ingestion of immunization records from client sites in either HL7 or proprietary message formats. The records needed to be received via HTTP in their raw format and pushed to S3 in parquet format with no degradation of information.

We created one pipeline to receive the HL7 raw record over HTTP, convert it to JSON and parquet, upload it to S3 for movement to the SQL back end by extract-transform-load (ETL) pipelines, and return an HTTP status indicating successful upload or error. Thanks to Data Flow’s ready support for HL7, no coding, scripting, or use of third-party libraries was necessary to parse and convert the data from HL7.

For the proprietary format, we created a similar pipeline that receives the data already converted to JSON format from our client-facing web services.

Periodic ingestion of immunization record batches in HL7 format from flat-files containing thousands of records. The batch files needed to be received via SFTP and moved to S3. Though Data Flow has components for working with SFTP servers, administrative and security requirements at the location made it necessary for another process tied to the SFTP server to move the batch files to S3.

Our solution included creation of parallel pipelines to periodically scan S3 for new batch files, download and archive them, split the batches into individual records and feed the individual records through a similar conversion and routing process as used for the real-time stream. However, before converting the records it was necessary for the pipeline to send them via HTTP to our middle-tier web services to verify and validate their structure and content. This was easily achievable through Data Flow’s ready support for S3 and HTTP protocols with rapid processing and scaling on AWS GovCloud.

Instead of HTTP status codes, HL7 formatted acknowledgement and validation error records in same-sized batch files matching the original input batches are returned to the client systems via S3 and SFTP. For this, MDACA Data Flow’s HTTP support is again used to generate the response records while its data merging components assemble them into the final response batch files routed back to the clients.

The only thing in these processes that Data Flow doesn’t already do is parse the web service responses for validation results and extract the generated response records. However, Data Flow includes components allowing these actions to be easily performed in a variety of scripting languages including Python, JavaScript, Lua, and others, as well as through external processes on the host system. We opted to write a few in-line Jython scripts to handle these tasks for the speed, ease, and portability offered by Python.

Near-real time updating of the back-end databases from the parquet formatted immunization data in S3. While we could have easily inserted the data directly into the back-end instead of dropping the converted records into S3, it was necessary to preserve the original records in object form for archival and auditing purposes. Therefore, we used an S3 data lake to store these objects and created several ETL pipelines to replicate their information into the back-end SQL tables.

Using Data Flow’s scheduling support and components for working with SQL, the ETL pipelines regularly scan the data lake using MDACA Big Data Virtualization (BDV) to query for new rows in BDV’s meta-stores. These queries return Apache Avro formatted results which the ETL pipelines then convert to SQL insert statements and execute them against the back-end tables.

Working under an accelerated timetable, leveraging MDACA Data Flow in these ways enabled us to deliver on DHA’s requirements quickly and efficiently within the accredited environment. Updating and modernizing the military’s immunization tracking capabilities could have required months of design and coding work following a traditional development model. However, with MDACA Data Flow - combined with our expertise in data ingestion and solution integration with AWS technologies - we accomplished it in a matter of weeks.

Senior Technical Architect at MDACA

Article source:
This article has been viewed 322 times.

Rate article

Article comments

There are no posted comments.

Related articles