A single server can only process a finite amount of data. Once all* its capacity is used, performance begins to degrade until the whole server will crash or come to a halt. Once you hit your current hardware’s limits you have two options:
Vertical Scaling: This means you add more resources to the server. That can be more CPUs, more Memory etc. In the pre-cloud days servers were running on specialized hardware where it was possible to add or remove memory and CPU while the server was running. Nowadays you just change a setting on your VM.
Horizontal Scaling: Instead of pumping up the existing hardware you run more of the same. So instead of having all your requests landing on one server, you will distribute the load over multiple servers that do the same thing.
Both methods have their pros and cons. Vertical scaling is easy because it doesn’t require any changes on the programs you are running, but there’s limit of how many CPUs or how much Memory you can add to your server.
Horizontal scaling is “unlimited” - you can always add another server. However, you application must be able to run multiple times at the same time without corrupting data for example. The management overhead is also bigger, you need some sort of coordination layer that distributes your requests to you fleet of servers, etc.
With the ubiquity of the cloud horizontal scaling has become the de facto standard how to scale your service and is one of the reasons for the triumph of microservices.
* Computers behave similarly like roads: a server cannot be utilized a 100% and still be fast. As a rule of thumb a server should not be utilized over 80% to avoid resource congestion.