I don't understand why I save it on a streaming server? such as Amazon Kinesis or Apache Stream before saving it to DB.

When it comes to large-scale data collection, you often need to stream and
Topics such as Amazon Kinesis and Apache Stream will appear.

Multiple Servers - > Amazon Kinesis - > Amazon Redshift

and so on.

Multiple Servers - > Amazon Redshift

Is it a transaction problem to save directly to DB from multiple servers?I understand that streaming processing is done because processing is not keeping up with it.

However, if you put a streaming server in between, you'll end up running the data from left to right, and eventually you won't be able to keep up with the processing.

Actually, that's not the case, and I think the streaming server is doing well, but I'm looking into it to learn about it, but I couldn't find any good materials.

*One thing that comes to mind is that the streaming server stores the data to some extent and puts the data into the DB when it is stored more than a certain amount.

I would appreciate it if you could let me know based on your experience.

Thank you for your cooperation.

aws

2022-09-30 19:37

3 Answers

If a large number of data sources are accessing the datastore directly,

Very difficult to make changes to the datastore
Performance depends on the data source and cannot be managed
Processing a small amount of data many times is too much of an overhead
Data source side processing is bloated
Data store side also takes action when data from multiple sources must be collected

There are many problems such as and so on.If there is something between the data source and the datastore that buffers, these problems can be easily resolved.

The buffer side knows where the datastore is, so you just have to change it
Output from buffers allows you to manage performance
Each datastore has a low overhead data entry method
Data source side can only specialize in sending data to buffer
Intermediate processing such as data deformation and aggregation

Performance is key, but not all.

2022-09-30 19:37

Learn what Amazon Kinesis is from examples

The article in the link above summarizes the advantages of Amazon Kinesis in a compact way, so it would be good if you could refer to it.
This article is a little over a year ago, so the information may be a little out of date.

2022-09-30 19:37

What comes to mind about the reason why I'm

If asynchronous processing does not equal the throughput of input data and the processing power of the output destination, I think it is common to temporarily save it somewhere (queueing, caching, etc.)

To ensure data order and allow batch processing of large amounts of input data in parallel, I think it is temporarily stored.

The ability to retrieve the same data over and over again may also be difficult with a proxy-like data transfer mechanism.

Considering that the resources are finite, I think that sometimes output processing cannot keep up.There is also a limitation to keeping data put to Kinesis up to 24 hours, so if you have never taken it out before and the output side has not processed it, you will miss it.

Also, there is no limit to Kinesis throughput, but it scales only to 10 shards by default.If the throughput is insufficient, the application should mitigate it.

Please refer to FAQ
https://aws.amazon.com/jp/kinesis/streams/faqs/

If it is the role of streaming processing, I think it is to process a large amount of input data in real time and output it to a data warehouse.The Apache Spark Stream site's materials were easy to understand as an overview.

http://spark.apache.org/talks/
http://spark.apache.org/talks/strata_spark_streaming.pdf

According to this document, there are five requirements for the Streaming Framework:
The role of streaming processing is to do this.

Scalable to large clusters
Achieves second-scale latencies
Has simple programming model
Integrates with batch&interactive workloads
Ensures efficient fault-tolerance in stateful computations

2022-09-30 19:37

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656