I want to scale out the periodic execution process that handles huge amounts of data according to the amount of data.

I would like to run scripts using WebAPI that return huge amounts of data regularly with relatively short 5-minute intervals.A script that runs periodically retrieves data from WebAPI and then jumbles it.

At the moment, the data is not that large, so I think I can sell it by myself, but the amount of data is expected to increase significantly in the future.Also, if the amount of data increases, you may not be able to process it in 5 minutes.
Therefore, we would like to have a flexible scale-out mechanism based on the amount of data.Specifically, we would like to have more servers running scripts to accommodate more data.

If you know any good implementation methods, tools, architecture, etc. in this situation, please let me know.
It doesn't matter if you don't think the fundamental part is good.

Thank you for your cooperation.

You can also increase parameters and add features to the WebAPI side
At the moment, we're thinking of having an API to return the total number of data to WebAPI to get that number, divide it by the number of servers that need to be batched, and equalize the number of servers that each server handles...

Dear Yuki Inoue and Kenji Noguchi, Thank you for your comment.Sorry for the lack of information.I'll add a little more.

The assumption seems to be that the periodic execution process that you are performing is something that you can scale-out.I felt that it would be difficult to give an answer unless what I wanted to do was defined in the periodic execution.

What you want to do with periodic execution is to queue jobs that make HTTP requests for the specified URL and throw out the results in the log.It is the role of another server to run the job.
Obtain URL information from API that returns huge data.

Please indicate the format of the input and output data, the amount of data, and what calculations will be made along the way.

■ Input and output
WebAPI currently has only pageation parameters.You can specify the page number to retrieve with the page parameter, and the per_page parameter specifies the number of pages to retrieve.
The WebAPI response is in JSON format with URL information.

[
  { "id" : 1, "url" : "http://example.com" },
  { "id" : 2, "url" : "http://example.co.jp" },
  ...
]

AP I have the API myself, so I can extend it.

■ Amount of data
The amount of data is not accurately estimated, but I only know that it will increase in the future, and I feel like it will collapse as it is, so I asked this question.

■ Processing in the middle
The batch script simply queues jobs that throw HTTP requests.

api batch-file

2022-09-30 21:11

2 Answers

If you want me to list your products and tools as an answer, this is an undesirable type of question because you can think of as many answers as you want.

There are many commercial and open source products, such as job management and batch management systems, and most recent languages should have a library that implements Job queue.Ad-hoc scripts will work in their own way.

There are many choices, so please search by the keywords mentioned above.

If you're trying to use a tool or write your own script and you're having trouble at that stage, you might get answers by clarifying that point and asking questions.

If you don't know how to do it at all or can't even think of a way to do it, I recommend you consult with SIer, who has such a product or offers solutions using open source products.

Now that I know it's service monitoring in the comments, I'll add it accordingly.

For service monitoring, the first choice is to use commercial system monitoring products and services such as Nagios and ZABBIX, but if you need to dynamically acquire 100 million items, you have to make your own.

If you ignore the number of cases,

It's like a job queue library practice task.

But when it comes to 100 million,

1 URL 30 bytes becomes 3GB and it takes time to create a job, so master distribution is also required (data split - worker registering job - worker actually processing 3 layers)
330,000 operations per second are required when it comes to 5 minutes.Optimistic 100 cases per second requires 3000 workers as well
If 1 URL 1 job is used, 100 million jobs.If you summarize multiple URLs, the size of 1 job will increase this time.Either way, you need to consider performance such as DB, which is actually queue.
Requires 13Gbps throughput to retrieve 5KB of data per URL

The hurdle is quite high.In reality, it is natural that the target is dead or slow to respond, so 1/10th of the time is still quite difficult.

At any rate, the entire system, including the network, needs to be considered, and technically and cost-effectively, it is absolutely impossible to scale out on an ad hoc scale-out.If the monitoring target is under control, you should reconsider it.

2022-09-30 21:11

Wouldn't it be a great service if all 100 million targets were open 24 hours a day?Maybe only a part of it is running?

When there is a large amount of monitoring, it is often convenient to send life and death information from the monitoring side (as a client) to the server instead of going to the server to inquire.This is called a client beacon.

The client beacon works simply by periodically sending HTTP GETs to the web server to aggregate access logs.If you are a questioner, you can get it in up to 5 minutes.If you embed information in the query string that informs you of the client's status, you can monitor it.

We often have two types of beacons, event and interval, which notify us of power on/off, various operations, etc.Interval beacons provide statistics such as utilization time.If there is nothing to do for 5 minutes, send the interval beacon.

The web server capacity calculation is the same as the regular web.That means you can count the number of servers in anticipation of how many of the 100 million clients are on at the same time.If 1 million clients are turned on at the same time and one server can sell 100,000 clients in 5 minutes, you need 10 servers.

From now on, if you aggregate the access logs of the server and understand the status of each client from the raw log, you will meet the requirements of the questioner.

Snakefoot

It has nothing to do with the questioner's requirements, but if the client is a mobile app, you can gather the access logs in a frame and aggregate them in a Spark/Hadoop cluster to extract a lot of valuable information.Type and value of beacons, type of mobile phone, carrier, IP address, statistics by usage time, etc.

2022-09-30 21:11

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656