Wednesday, February 15, 2023

Text analytics on AWS: implementing a data lake architecture with OpenSearch ( Part 1)

Text data is a common type of unstructured data found in analytics. It is often stored without a predefined format and can be hard to obtain and process.

web pages contain text data that data analysts collect through web scraping and pre-process using lowercasing, stemming, and lemmatisation. After pre-processing, the cleaned text is analyzed by data scientists and analysts to extract relevant insights.

We can handle text data using a data lake architecture on Amazon Web Services 

Below is a reference architecture from AWS blog for creating an end-to-end text analytics solution, starting from the data collection and ingestion up to the data consumption in OpenSearch 


The detailed is below 

1. Collect data from various sources, such as SaaS applications, edge devices, logs, streaming media, and social networks.
2. Use tools like AWS Database Migration Service (AWS DMS), AWS DataSync, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), AWS IoT Core, and Amazon AppFlow to ingest the data into the AWS data lake, depending on the data source type.
3. Store the ingested data in the raw zone of the Amazon Simple Storage Service (Amazon S3) data lake—a temporary area where data is kept in its original form.
4. Validate, clean, normalize, transform, and enrich the data through a series of pre-processing steps using AWS Glue or Amazon EMR.
5. Place the data that is ready to be indexed in the indexing zone.
6. Use AWS Lambda to index the documents into OpenSearch and store them back in the data lake with a unique identifier.
7. Use the clean zone as the source of truth for teams to consume the data and calculate additional metrics.
8. Develop, train, and generate new metrics using machine learning (ML) models with Amazon SageMaker or artificial intelligence (AI) services like Amazon Comprehend.
9. Store the new metrics in the enriching zone along with the identifier of the OpenSearch document.
10. Use the identifier column from the initial indexing phase to identify the correct documents and update them in OpenSearch with the newly calculated metrics using AWS Lambda.
11. Use OpenSearch to search through the documents and visualize them with metrics using OpenSearch Dashboards.

references:
https://aws.amazon.com/blogs/architecture/text-analytics-on-aws-implementing-a-data-lake-architecture-with-opensearch/




No comments:

Post a Comment