cpus: higher performance non-cacheable load forwarding

The CPUACTLR_EL1 register on Cortex-A57 CPUs supports a bit to enable
non-cacheable streaming enhancement. Platforms can set this bit only
if their memory system meets the requirement that cache line fill
requests from the Cortex-A57 processor are atomic.

This patch adds support to enable higher performance non-cacheable load
forwarding for such platforms. Platforms must enable this support by
setting the 'A57_ENABLE_NONCACHEABLE_LOAD_FWD' flag from their
makefiles. This flag is disabled by default.

Change-Id: Ib27e55dd68d11a50962c0bbc5b89072208b4bac5
Signed-off-by: Varun Wadekar <vwadekar@nvidia.com>
diff --git a/docs/design/cpu-specific-build-macros.rst b/docs/design/cpu-specific-build-macros.rst
index f3096b4..258f73d 100644
--- a/docs/design/cpu-specific-build-macros.rst
+++ b/docs/design/cpu-specific-build-macros.rst
@@ -324,6 +324,13 @@
    as recommended in section "4.7 Non-Temporal Loads/Stores" of the
    `Cortex-A57 Software Optimization Guide`_.
 
+- ''A57_ENABLE_NON_CACHEABLE_LOAD_FWD'': This flag enables non-cacheable
+   streaming enhancement feature for Cortex-A57 CPUs. Platforms can set
+   this bit only if their memory system meets the requirement that cache
+   line fill requests from the Cortex-A57 processor are atomic. Each
+   Cortex-A57 based platform must make its own decision on whether to use
+   the optimization. This flag is disabled by default.
+
 -  ``NEOVERSE_N1_EXTERNAL_LLC``: This flag indicates that an external last
    level cache(LLC) is present in the system, and that the DataSource field
    on the master CHI interface indicates when data is returned from the LLC.