After enough big pause in my posts, I have decided a bit change the topic of my publications.
Today I want to discuss the theme like "Designing of data-intensive applications".
Introduction to data systems
Let's start from the understanding of what is this "Data-intensive applications" on our time.
In the modern world, applications are processing more and more data. All developers are trying to design applications which are in a fast way can do his job. But how to plan the correct architecture of your future application, that we will have an application which is doing his required functions, is secure and fast in the finish? How to understand that your application will work well after the release of features, increasing data or performance load? How to predict any failures on the future system?
I want to review all of these cases on my new "Architecture" topic.
I think we should start from basic requirements to your system before we will start to do technical details of the application.
Each application should be designed followed by three required options :
- application should be scalable
- application should be reliable
- application should be maintainable
Let's review all of them one by one in details.
- is a term in software development is meaning that your system should correctly react to unexpected errors in the system, independent from of kind of errors software, hardware or even human.
System is marked as reliable if :
- system is working without errors
- system is doing all required functions
- system can expect any kind of errors and process them
- system is secured and controlling all non-authorized requests
- system has stable performance and not require additional staff like memory, disk, etc.
If we are reviewing the reliability of systems, we need to understand all possible faults.
Also, please don't confuse faults and failures.
- when the component of the system has stopped to work but in general system is up
- when the system has completely stopped to work and all functional has down
Hardware fault - is the case when system failure reason is a hardware issue, like disk crash, power grid blackout or any other.
But how we can avoid such issues or minimize the impact? There are just a few solutions which can help to avoid the moving faults to failures.
- Software support - when applications are helping to save data and helping to not stop functionality in hardware faults (ex. putting disks on raids)
- Expectation - when we are having visual and actual state of hardware (ex. monitoring of your system)
- Hardware support- when we can put additional hardware to avoid downtime (ex. dual power supplies)
- Cloud solution - when we have the possibility to migrate our systems to clouds and all responsibilities about hardware move to their side
- Avoid one node solution - when we are scaling our systems to few hosts which can work in parallel of backup mode
- Avoid relations between hardware faults - when one hardware fault can create another one (ex. when one cold system on all server rack can stop all hosts)
- Monitoring of MTTF (Mean time to failure) - is the expectation when the hardware can stop work normally
Software fault - is systematic error when the system is logging some issues with the environment. It can be any issues, for example, it can be a Linux kernel issue, some depends process going to slow down, etc. But unfortunately, there are no quick solutions to these problems. But still, we can define possible solutions to our problem :
- Testing of system
- Monitoring of system
- Analysis of system and prediction of behavior
Human fault - is an error due to human reason. It can be any issue, wrong configuration or code issues. This kind of errors is taking over about 80% of all faults. That's why to important to know about them. How we can avoid them? I'm providing just a basic list of solutions, cause list can be individual, depends to project.
- Design system based on abstractions - if we will separate our system by abstractions we can restrict actions (ex. restricted admin panel or API)
- Avoid live human testing - it's mean that we will check our features on dev or pre-prod environment, before release on production
- All level of testing - is covering your system on all level of tests, from unit to integration testing. Also recomended to include manual testing
- Quick recover - it can be for example quick rollback to backuped version
- Detailed monitoring - is monitoring all metrics from perfomance to application stats. The best monitoring is when you see the problem before it will appear in warning state (telemetry monitoring)
- Correct managment - when you are teaching team to correct use you application and are providing up to date documentation
- is a term which is describing the availability of the system to cope with increased load.
To prepare your system to the future load and be able to scale it, you need to be sure that you have done the next steps :
- Describe load - it's mean that you should calculate your current state of a load of your system. It should be all possible metrics, like HTTP requests on all your nodes, writes to the database, rate in the cache, etc.
- Describe performance - it's mean that after taking notes about your load, you should investigate cases when the load will be increased. The next things should be checked: 1) When will happen with the system, if the load will grow up, but the hardware will stay the same; 2) How much time and resources you should spend, to keep performance in the same level with increased load.
- Find approaches for coping with the load - in this step you should understand, which type of scaling you need to apply, horizontal or vertical.
Also, to a better understanding of your system and load investigation, create correct monitoring (SLA, SLO), it will help to make a more clear view of future scaling.
- is a term describing the support process of your system. It can be anything in this list, like fixing of bugs, investigation of failures, adapting to new platforms or even adding new features.
Interesting fact, that majority cost is spending not on development but on ongoing maintenance of it. So our main target is to set up the application in the right way to minimize the pain and cost of future supporting.
To do this, we need to review three main principles of the development of software :
1) Operability - make your application easy to operation team for supporting
The good way to operate your application is the correct automation of your software. In a perfect way, the operation team should only maintain via automation your application without manual steps, but all systems have some black corners which very hard to prognosis. So, let's review the main responsibilities of the operation team:
- monitor the health of services and perfomance
- quick react on any failures and develop self-healing services to such cases
- investigation reasons of failures
- keep system up to date (implementing patches and updates)
- isolation services from each other to avoid chain reaction from one issue to another one
- prediction of errors based on monitoring and audit of errors
- configure correct CI/CD processes
- correct management of processes
- keep up to date documentation
- maintaine the security of applications
- define problems on non-production environments to avoid them on production
- share information about the the system to all team members
2) Simplicity - make easy to learn your product to new engineers.
But don't confuse abstraction of complexity. It doesn't mean that you should avoid any features or reduce functionality, no. It's a not correct way to resolution. The correct one is the creation of a pleasant environment for new employees in way documentation, leading, abstractions and so on.
is the best way of solution of simplicity issue. It's mean that we can hide separated parts. Let's review the example. Imagine that we have high-level programming language which has hide machine code, it's one abstraction. he database which consists complex of disk, memory and data structures, it's the second abstraction. From this, we have two not connected abstractions, and when we are going to fix one we are not worrying about the other one. It's a very simple example, but the main purpose is describing here. If you have some issues in one thing, you know that it will not affect the other one.
There are the main symptoms of the complexity :
- tight communication between modules
- tangled dependencies
- special workarounds on a lot of spaces
- temporary hacking to solve performance issues
- inconsistent naming or terminology
3) Evolvability - make easy to change your application for next feature or upgrades (also known like "Modifiability").
During the development and future modification, it's not possible to save one list of dependencies, cause it's growing up. All changes are required attentions on them, it can be business priorities, functional features, grow system or architecture changes. And our point should be the correct prioritizing them. In this way, a possible solution can be Agile planning. Thanks to this, we can correct plan and focus on big changes and small fixes. More about this you will here soon on the next posts.
That's all for today. Thanks for reading my posts and I hope that it was useful to you my friend like it was for myself.
Please let me know if you're interested in such kind of topics, it will give more motivation to continue our big way.