PowerFlex, an industry-leading software-defined storage solution, offers unmatched flexibility and scalability for modern data centers. To harness its full potential, it’s essential to follow best practices that ensure optimal performance and reliability. In this comprehensive guide, I have tried to summarise it into a table format, making it easier than ever to implement these essential PowerFlex best practices.
As an IT expert, I know that all the provided best practises can’t be implemented by every environment. So you can add extra columns, if you can or want to follow it in your environment, you can simple say Y/N and Justification in case, it can’t be followed due to any reason.
Aim of this article and best practises table is to help you to quickly look into the Dell EMC PowerFlex Storage best practises without reading lot of lengthy Dell guides.
Dell EMC PowerFlex best practises – Excel format
Category | Identifier | Best practises |
Security | 1.1.1 | Use a strong and complex password policy for all accounts associated with Powerflex (a.k.a ScaleIO). This includes the management console, storage nodes, and gateway nodes. Change the password for all default accounts |
Security | 1.1.2 | Implement Data encryption with careful planning |
Security | 1.1.3 | Configure SSL/TLS encryption for all communication between Powerflex components. This can be done during installation or post-installation. |
Security | 1.1.4 | Implement firewall rules to restrict access to the Powerflex management network. This network should only be accessible by authorized administrators. |
Security | 1.1.5 | Implement Role-Based Access Control (RBAC) to restrict access to Powerflex resources. This can be done using the Powerflex REST API or the management console. |
Security | 1.1.6 | Enable the secure network protocol options only. For example, HTTPS and SSH |
Security | 1.1.7 | Implement two-factor authentication for added security. |
Design | 1.1.8 | Plan the Powerflex architecture to meet performance, capacity, and availability requirements. This includes the number of storage and gateway nodes, the type of storage devices, and the network topology. |
Design | 1.1.9 | Use dedicated storage network for Powerflex traffic. This network should be separate from other production traffic to avoid performance degradation and security risks. |
Design | 1.1.10 | Plan for redundancy at all levels of the Powerflex infrastructure. This includes redundant storage and gateway nodes, network paths, and power supplies. |
Design | 1.1.11 | Use high-quality server hardware and storage devices to ensure reliability and performance. Dell EMC provides a list of certified hardware for Powerflex. |
Design | 1.1.12 | Use multiple network interfaces for redundancy and load balancing. This can be done using bonding, teaming, or other technologies.LACP is recommended when link aggregation groups are used. The use of static link aggregation is not supported |
Design | 1.1.13 | If a node running an SDS has aggregated links to the switch and is running VMware ESX, the hash mode should be configured to use “Source and destination IP address” or “Source and destination IP address and TCP/UDP port”. If a node running an SDS has aggregated links to the switch and is running Linux, the hash mode on Linux should be configured to use the “xmit_hash_policy=layer2+3” or “xmit_hash_policy=layer3+4” bonding option. The “xmit_hash_policy=layer2+3” bonding option uses the source and destination MAC and IP addresses for load balancing. The “xmit_hash_policy=layer3+4” bonding option uses the source and destination IP addresses and TCP/UDP ports for load balancing. On Linux, the “miimon=100” bonding option should also be used. This option directs Linux to verify the status of each physical link every 100 milliseconds. |
Design | 1.1.14 | Plan for future growth by ensuring that the Powerflex infrastructure is scalable. This includes adding additional storage and gateway nodes as needed. |
Design | 1.1.15 | Ensure that the number of SDS threads is set to 12 for all the SDS nodes |
Network | 1.1.16 | Use high-quality network hardware for the Powerflex network. This includes switches, routers, and network interface cards.Dell Technologies recommends the use of a non-blocking network design.OSPF is recommended over BGP |
Network | 1.1.17 | Use a dedicated network for Powerflex traffic. This network should be isolated from other production traffic to avoid interference and security risks.Separate the management and data traffic from the production application traffic using VLANs |
Network | 1.1.18 | Enable jumbo frames to increase network throughput. This can be done on the network switches and the Powerflex nodes. |
Network | 1.1.19 | Implement Quality of Service (QoS) to prioritize Powerflex traffic. This can be done on the network switches to ensure that Powerflex traffic is not impacted by other traffic on the network..There should be no more than 200ms latency between source and target systems (Replication scenarios).As a best practice, Dell recommend that the sustained write bandwidth of all volumes being replicated should not exceed 80% of the total available WAN bandwidth |
Network | 1.1.20 | Implement network redundancy to ensure availability. This includes redundant network paths, switches, and routers.For best performance, latency for all SDS and SDC communication should never exceed 1 millisecond network-only round-trip time under normal operating conditions. |
Network | 1.1.21 | Use multiple network interfaces for redundancy and load balancing. This can be done using bonding, teaming, or other technologies.Network latency between peered PowerFlex cluster components, however, whether MDM→MDM or SDR→SDR, should not exceed 200ms round trip time |
Network | 1.1.22 | Separate the VMware vSphere vmotion traffic from the application traffic according to PowerFlex documentation |
Network | 1.1.23 | MDM to MDM traffic requires a stable, reliable, low latency network. At a minimum, two 10 GbE links should be used per MDM for production environments, although 25GbE is more common |
Network | 1.1.24 | With MDM to MDM traffic, IP-level redundancy or LAG is strongly recommended over MLAG, as the continued availability of one IP address on the MDM helps prevent failovers, due to the short timeouts between MDMs, which are designed to communicate between multiple IP addresses |
Configuration | 1.1.25 | Use recommended RAID configurations for storage devices. This includes RAID 10 for optimal performance and redundancy. |
Configuration | 1.1.26 | Ensure that storage devices are correctly formatted and aligned. This includes using the correct sector size and partition alignment. |
Configuration | 1.1.27 | Use recommended server and network settings for optimal performance. This includes configuring the network settings for the Powerflex nodes and the network switches. |
Configuration | 1.1.28 | Monitor the Powerflex infrastructure for errors and performance issues. This can be done using the management console or third-party monitoring tools. |
Configuration | 1.1.29 | On Cisco Nexus switches, “carrier-delay” timer should be set to 100 milliseconds on each SVI interface, and “link debounce” timer should be set to 500 milliseconds on each physical interface |
Configuration | 1.1.30 | Ensure that the minimum number of nodes for production workload is seven, that means four storage only nodes and three compute only nodes for a PowerFlex two-layer configuration in a VMware environment. A minimum of three compute only nodes in an ESXi cluster is recommended to allow for HA/DRS. |
Configuration | 1.1.31 | Homogenous node types are recommended for predictable performance. With homogeneous compute nodes and homogeneous storage nodes, compute and storage do not need to be the same. |
Configuration | 1.1.32 | The use of Equal-Cost Multi-Path Routing (ECMP) is required.To provide stable intra-MDM communication, a sub-300 millisecond convergence time is required in case of OSPF.Additionally, for L3 handoff in ToR-Agg (Access-Agg) topologies, OSPF interfaces should be configured as point-to-point |
Performance | 1.1.33 | Use high-performance storage devices to achieve optimal performance. Solid State Drives (SSDs) are recommended for high-performance workloads. |
Performance | 1.1.34 | Use multiple storage devices per Powerflex node to increase performance and redundancy. |
Performance | 1.1.35 | Configure the cache size and IO thread settings according to workload requirements. |
Performance | 1.1.36 | Monitor the performance of the Powerflex infrastructure using performance monitoring tools |
Performance | 1.1.37 | Optimize network settings such as MTU, buffer size, and TCP window size to achieve optimal performance. |
Performance | 1.1.38 | Monitor disk usage and perform routine maintenance such as garbage collection and compaction to ensure optimal performance. |
Performance | 1.1.39 | Disable the Intel c-state and p-state driver in the Linux grub setting to comply with the performance best practices and rebuild the grub boot image: GRUB_CMDLINE_LINUX=”crashkernel=auto rd.lvm.lv=rhel/rootrd.lvm.lv=rhel/swap rhgb intel_idle.max_cstate= 1 intel_pstate=disable quiet” grub2-mkconfig -o /boot/grub2/grub.cfg grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg |
Performance | 1.1.40 | It is recommended to use noop for Linux I/O scheduler on SDC hosts when SSD drives are used. To make the changes persistent through boot add elevator=noop to GRUB_CMD_LINUX line and rebuild the grub boot image. For example: GRUB_CMDLINE_LINUX=”crashkernel=auto rd.lvm.lv=rhel/rootrd.lvm.lv=rhel/swap rhgb quiet elevator=noop” |
Availability | 1.1.41 | Configure multiple replicas of data for high availability. This ensures that data can be accessed even if one or more storage nodes fail. |
Availability | 1.1.42 | Use fault-tolerant hardware such as redundant power supplies and network cards. |
Availability | 1.1.43 | Configure alerts and notifications to proactively identify and resolve issues. |
Availability | 1.1.44 | Perform routine maintenance such as software updates and hardware checks to ensure high availability. |
Availability | 1.1.45 | Test failover and disaster recovery scenarios to ensure that data can be recovered in the event of a disaster. |
Scalability | 1.1.46 | Plan for future growth by ensuring that the Powerflex infrastructure is scalable. This includes adding additional storage and gateway nodes as needed. |
Scalability | 1.1.47 | Monitor the performance and capacity of the Powerflex infrastructure to identify areas that may need additional resources. |
Scalability | 1.1.48 | Use automation tools such as Ansible or Terraform to automate the deployment and configuration of Powerflex nodes. |
Disaster Recovery | 1.1.49 | Use synchronous or asynchronous replication to replicate data to a secondary site. |
Disaster Recovery | 1.1.50 | Test failover and recovery scenarios to ensure that data can be recovered in the event of a disaster. A one-hour outage might be reasonably expected, but Dell strongly encourage users to plan for 3 hours. One must ensure sufficient journal space to account for the application writes during the outage. In general, the journal capacity should be calculated as Peak Write Bandwidth * link down time. |
Disaster Recovery | 1.1.51 | Use backup and recovery solutions such as Dell EMC Data Protection Suite to ensure that data can be recovered in the event of data loss or corruption. |
Monitoring | 1.1.52 | Monitor the health and performance of the Powerflex infrastructure using monitoring tools such as Dell EMC ViPR or Nagios. |
Monitoring | 1.1.53 | Configure alerts and notifications for critical events such as node failures or capacity thresholds. |
Monitoring | 1.1.54 | Monitor system logs and audit trails to identify potential security issues or unauthorized access attempts. |
Monitoring | 1.1.55 | Monitor following areas using monitoring tools (Inhouse/OEM) – • Input and output traffic • Errors, discards, and overruns • Physical port status • Latency • I/O throughput |
Storage Pools | 1.1.56 | Plan storage pool sizes based on workload requirements, and avoid creating oversized or undersized storage pools. Avoid creating too many small storage pools as this can lead to higher administrative overhead and complexity |
Storage Pools | 1.1.57 | Consider the physical location of storage pools when creating them, and place storage pools on different physical storage devices and/or servers to distribute workload and ensure data redundancy. If possible, place storage pools on separate racks or even separate data centers to provide additional protection against natural disasters or other catastrophic events |
Storage Pools | 1.1.58 | Use replication to ensure data availability and protection against node or device failures. Consider the trade-off between replication level and storage overhead, and choose the appropriate level of replication based on workload requirements and budget constraints. Monitor replication status and consider implementing additional replication or data protection solutions if necessary. |
Storage Pools | 1.1.59 | Optimize storage pool performance by configuring cache sizes and I/O thread settings according to workload requirements. Use Solid State Drives (SSDs) for high-performance storage pools. Monitor storage pool performance using performance monitoring tools such as Grafana or Zabbix and adjust settings as necessary. |
Storage Pools | 1.1.60 | Plan for future growth by creating storage pools with additional capacity and expandability in mind. Monitor storage pool usage and capacity to proactively identify potential growth areas and adjust storage pool size and placement as needed. Maintain at least 20% free capacity in Pools. |
Provisioning | 1.1.61 | Plan the provisioning process carefully, taking into consideration factors such as workload requirements, available resources, and future growth. Identify potential bottlenecks or limitations in the Powerflex infrastructure that may impact provisioning performance or capacity |
Provisioning | 1.1.62 | Use automation tools such as Ansible or Terraform to automate the provisioning process, reducing the risk of human error and increasing efficiency. Create templates or scripts for common provisioning scenarios to streamline the process and reduce manual effort. |
Provisioning | 1.1.63 | Follow recommended configuration guidelines for Powerflex nodes and storage pools when provisioning new resources. Use the Powerflex GUI or command-line interface to configure new nodes, storage pools, and volumes. Consider the optimal placement of new resources within the Powerflex infrastructure to ensure performance and redundancy. |
Provisioning | 1.1.64 | Test new resources thoroughly before deploying them into production. Use synthetic workloads or test scripts to simulate actual workload conditions and identify potential issues or bottlenecks. Monitor resource usage and performance during testing to ensure that new resources are performing as expected. |
Provisioning | 1.1.65 | Monitor the health and performance of newly provisioned resources using monitoring tools such as Dell EMC ViPR or Nagios. Configure alerts and notifications for critical events such as resource failures or capacity thresholds. Monitor system logs and audit trails to identify potential security issues or unauthorized access attempts |
Testing | 1.1.66 | Perform SDS Network Test and SDS Network Latency Meter Test |
Testing | 1.1.67 | Hardware and configuration testing which includes Power, Redundancy and Logical configuration of the cluster |
Testing | 1.1.68 | Use Iperf and NetPerf to validate your network before configuring PowerFlex |
Conclusion
By following these PowerFlex best practices, you can harness the full potential of this robust Dell EMC storage solution. Whether you’re looking to optimise performance, enhance reliability, or secure your data, I hope these guidelines will set you on the path to PowerFlex mastery. Implement them today and reap the benefits in your data centre operations.
These are the best practises which I implemented in my environment, in case you some different requirements, You can refer this official guide, for more details.