Date: | 2013-06-29 |
---|---|
Author: | Nikolaos Kavvadias <nkavvadias@ajaxcompilers.com> |
Email: | info@ajaxcompilers.com |
Revision: | 0.0.1 (2012-05-10), 0.0.2 (2013-05-29) |
Web site: | http://www.ajaxcompilers.com |
Copyright: | Nikolaos Kavvadias (C) 2009, 2010, 2011, 2012, 2013 |
Contents
HercuLeS is a High-level synthesis tool that automatically generates RTL VHDL for non-programmable hardware. HercuLeS translates programs in NAC (a bit-accurate typed-assembly language) to extended FSMDs (Finite-State Machines with Datapath) in VHDL. HercuLeS can also be used for direct synthesis of ANSI C code to VHDL with the help of a prototype translator from GIMPLE which is GCC's new intermediate representation to NAC.
Internally, HercuLeS comprises of two main components: a frontend (nac2cdfg) and a graph-based backend (cdfg2hdl):
HercuLeS also has an additional ANSI C backend, allowing comparison of NAC programs to reference ANSI C application code and the rapid prototyping of applications (VHDL simulation can be slow depending on design complexity, input data and the simulator used).
VHDL code generated by HercuLeS can be simulated with GHDL and the industry- standard Modelsim. It is possible to generate VHDL using either the Synopsys packages (the "old" de-facto standard) or the official IEEE library packages. HercuLeS supports fixed-point arithmetic via sfixed and ufixed vectors as defined by the VHDL-2008 fixed-point arithmetic packages. For this option, HercuLeS should be notified (via command-line option) to use the IEEE packages.
The basic steps in the HercuLeS flow are shown in Fig. hercules-overview. C code is passed to GCC for GIMPLE dump generation, optionally following an external source-level optimizer. Textual GIMPLE is then processed by gimple2nac; alternatively the user could directly supply a NAC translation unit (TU).
Various optimizations have been applied at the NAC level; peephole transformations, if-conversion, and function call insertion to enable IP integration. Heuristic basic block partitioning avoids the introduction of excessive critical paths due to operation chaining.
The core of HercuLeS comprises of a frontend (nac2cdfg) and a graph-based backend (cdfg2hdl). nac2cdfg is a translator from NAC to flat CDFGs represented in Graphviz. cdfg2hdl is the actual synthesis kernel for automatic FSMD hardware and self-checking testbench generation from Graphviz CDFGs to VHDL.
nac2cdfg is used for parsing, analysis and CDFG extraction from NAC programs. SSA form is supported based on minimal generation algorithms. Data flow analysis uses on-demand graph reachability checking.
cdfg2hdl maps CDFGs to an extended FSMD MoC (Model of Computation). For scheduling operations to specific states, either sequential or control- aware ASAP scheduling can be used. ASAP can be combined with fast operation chaining for better state workload balancing.
The generated FSMDs are generalized FSMs introducing embedded actions, with: a) support of array input, output and streaming I/O ports, b) communication with embedded block and distributed LUT memories, c) latency-insensitive local interface between caller and callee FSMDs, and d) interfacing to external IP blocks.
An additional ANSI C backend allows for rapid algorithm prototyping and NAC verification. VHDL code can be simulated with GHDL and Modelsim.
The current features of HercuLeS include:
The purpose of this text is to provide a quick-start guide to using the HercuLeS high-level synthesis tool through a web interface. This version of HercuLeS does not provide access to certain features such as arithmetic and loop-oriented optimizers.
Minimal requirements:
Visit the HercuLeS web interface.
Unzip http://www.nkavvadias.com/hercules/hercules-contrib-vhdl.zip to a local directory, e.g. C:\hercules\contrib
Create an empty directory, e.g. C:\hercules\tests
Either select the supplied (already pasted) example of the "fact" (factorial) function or copy and paste your own. Download http://www.nkavvadias.com/hercules/small-examples.zip for a few ANSI C and NAC (generic assembly) code samples.
NOTES:
#ifdef TEST #endif
- In addition, it is expected that the main() function generates input and reference output samples in hexadecimal format and in separate columns. A proper main() would generate such samples in a file named fact_test_data.txt (for the fact example).
- Standard C library includes should be also enclosed by the aforementioned directive.
- Read Section 5 of http://www.nkavvadias.com/hercules/hercules-web-guide.html for a short guide on ANSI C code style and limitations (WIP).
The following figure gives an internal view to the process flow of HercuLeS.
The user of HercuLeS must provide two input files:
Then, nac2cdfg generates several files:
Following this, there exist two possible flows; one for the generation and simulation of synthesizable RTL VHDL for the NAC program, and one for a C simulation.
The C simulation flow proceeds by invoking the ansic.mk makefile by running:
make -f ansic.mk
from the command line. This produces a main.exe executable specification (e.g. on Windows/Cygwin). Then, the executable is run:
./main
and output is produced at the command prompt.
The VHDL flow involves processing all CDFG (.dot) files by cdfg2hdl, the actual backend tool of HercuLeS. cdfg2hdl generates several files:
Finally, the design.sh script is run from the command line:
./design.sh
This produces a text file (design_alg_test_results.txt) providing diagnostic output from a simulation run. Output to the command prompt for any internal program variable, procedure argument, etc can be produced by using the "print" NAC operation. A "print" is mapped to a VHDL "assert" construct or a C standard library "printf".
Also, a VCD (design_fsmd.vcd) or GHW (design_fsmd.ghw) waveform file can be generated for viewing with GTKwave. Windows binaries for GTKwave can be found at http://www.dspia.com/gtkwave.html.
The usage of the nac2cdfg is as follows:
nac2cdfg [options] input.nac
where options is one or more of the following:
The usage of the cdfg2hdl is as follows:
cdfg2hdl [options] input.dot
where options is one or more of the following:
A novel, fast CDFG construction algorithm has been devised for both SSA and non-SSA NAC forms producing flat CDFGs as Graphviz files. A CDFG symbol table item is a node (operation, procedure call, globalvar, or constant) or edge (localvar) with user-defined attributes: the unique name, label and data type specification; node and edge type enumeration; respective order of incoming or outgoing edges; input/output argument order of a node and basic block index. Further attributes can be defined, e.g. for scheduling bookkeeping.
This approach is unique since it focuses on building the CDFG symbol table (st) from which the associated graph (cdfg) is constructed as one possible of many facets. It naturally supports loop-carried dependencies and array accesses.
// CDFG construction algorithm accepting BASIL input. NACtoCDFG() input List BASILs, List variables, List labels, Graph cfg; output SymbolTable st, Graph cdfg; begin Insert constant, input/output arguments and globalvar operand nodes to st; Insert operation nodes; Insert incoming {global/constant/input, operation} and outgoing {operation, global/output} edges; Add control-dependence edges among operation nodes; Add data-dependence edges among operation nodes, extract loop-carried dependencies via cfg-reachability; Generate cdfg from st; end
gimple2nac translates GIMPLE dumps to NAC TUs, which presents challenges. Currently, GIMPLE loses source semantics such as global scope variables. A workaround for some cases is the definition of static variables in the original source. Other issues involve: format inconsistencies following different optimization passes, handling labels and array initialization sequences, destroyed interfaces and lack of bit-accuracy. Some of these problems are targeted by a GCC rewrite [GIMPLEbe] that will allow emitting GIMPLE as a target language.
GIMPLE is the machine-independent intermediate representation used in modern GCC releases (post version 4.0). While the GIMPLE API for code generation and manipulation has matured over time, the corresponding textual representation is yet to be in a stable form. This issue hampers many serious efforts for code generation from and to GIMPLE.
Good candidates for the textual representation seem to have been in play for some time. The wiki site http://gcc.gnu.org/wiki/GIMPLE sketches the textual IR that is expected to be implemented by the GIMPLE frontend and backend (both under development). Another case of a textual GIMPLE are the formats generated as a GIMPLE dumps. We distinguish here two formats: the format in *.t004.gimple files (tagged GIMPLE) and the one e.g. in *.t140.optimized files (GIMPLE-C). Both formats express the low-level GIMPLE representation, which is closer to classic three-address code than high-level GIMPLE; the latter is closer to the GENERIC AST representation. An extended form of GIMPLE dumps are expected to be established as the GIMPLE language semantics.
For reference tagged GIMPLE is generated by the command-line option:
-fdump-tree-all-raw
while GIMPLE-C is generated by:
-fdump-tree-all and -fdump-tree-gimple
the latter only emitting the *.t004.gimple file.
The rest of this document discusses open issues with the textual GIMPLE IRs, focusing on the tagged GIMPLE format. It will not cover extended semantics issues that are covered much better at http://gcc.gnu.org/wiki/GIMPLEFrontEnd
This section is expected to be of interest to compiler/translator implementors using GIMPLE as either a source or target language.
Here follows a listing of some issues that can be identified with using tagged GIMPLE. Most of them apply to the GIMPLE-C format as well.
One such example is the omission of emitting global variables. A workaround for some cases is the definition of static variables in the original source. This approach provides only specific file/translation unit scope to globals but it is not certain whether the extern specifier is handled properly for referencing these globals from external scope.
Automatically generated labels (by the gimplifier) and labels defined in the source program are represented differently. The first category are enclosed in single wedges < > while the later are explicitly defined and are used without wedges.
For example, this is the definition of an automatically generated label:
gimple_label <<D.1983>>
while the following is the (redundant) definition:
void V1 = <<< error >>>; ... ``gimple_label <V1>
and the use of a source label:
gimple_cond <eq_expr, D.1985, 1, V5, <D.1986>>
Function interfaces are not maintained appropriately since the original argument types in a function definition may be replaced. This is the case with array arguments (with static sizes) in the definition of non-root procedures.
Here follows an example. This is the a partial view of the source program:
void evalcoins(int n, int amount, int C[], int *ncoins, int D[]) { ... } void coins(int n_eurocents, int *n_coins_used) { int C_euro[15], D_euro[15]; ... evalcoins(15, n_eurocents, C_euro, &n_items, D_euro); ... } However, this is the tagged GIMPLE form of the same program:
evalcoins (int n, int amount, int * C, int * ncoins, int * D) gimple_bind < ... > coins (int n_eurocents, int * n_coins_used) gimple_bind < int C_euro[15]; int D_euro[15]; ... gimple_call <evalcoins, NULL, 15, n_eurocents, &C_euro[0], &n_items, &D_euro[0]> ... >
Low-level GIMPLE (tagged and GIMPLE-C) use two basic specific operations for dealing with pointer expressions and indirect references, the pointer_plus_expr and indirect_ref.
A pointer_plus_expr in tagged GIMPLE appears as follows:
gimple_assign<pointer_plus_expr, D.1986, D, D.1985>
where D is an array, D.1985 a temporary int variable and D.1986 a temporary variable defined as pointer to int.
The same in GIMPLE-C is the following:
D.1986 = D + D.1985
This operation adds the offset determined by D.1985 to the base address of array D, expressed as D. Pointer D.1986 then can be used for intexing the array.
A typical idiom in generated GIMPLE suggests that pointer_plus_expr is followed by an indirect_ref. The indirect reference is used to access the array and loading the contents of a memory position to a variable.
An indirect_ref in tagged GIMPLE appears as follows:
gimple_assign<indirect_ref, D.1987, *D.1986, NULL>
and the same in GIMPLE-C is:
D.1987 = *D.1986;
In order to avoid a thorough pointer analysis for establishing that D.1986 points to the contents of array D, typical data-dependence analysis can be used to trace that D is the referenced entity by D.1986.
Function calls are represented by gimple_call in tagged GIMPLE. Due to the issue 3 (Destroyed interfaces), in some cases calls-by-reference appear when not really needed. This refers to the simulated call-by-reference available in the C programming language, and not the actual kind that can be found, e.g. in Perl.
For example the following call is by-reference:
gimple_call <..., &C_euro[0],...>
making use of the address of the first element of C_euro (the base address).
With interfaces kept unchanged, the following would suffice:
gimple_call <..., C_euro, ...>
An array can be initialized either by a literal initialization list or by emitting a sequence of operations for initializing its contents. From a black-box point of view, it seems that the gimplifier arbitrarily chooses which approach to follow.
For example, in our example, the C_euro is initialized via explicit operations:
gimple_assign <integer_cst, C_euro[0], 1, NULL> gimple_assign <integer_cst, C_euro[1], 2, NULL> gimple_assign <integer_cst, C_euro[2], 5, NULL> gimple_assign <integer_cst, C_euro[3], 10, NULL> gimple_assign <integer_cst, C_euro[4], 20, NULL> gimple_assign <integer_cst, C_euro[5], 50, NULL> gimple_assign <integer_cst, C_euro[6], 100, NULL> gimple_assign <integer_cst, C_euro[7], 200, NULL> gimple_assign <integer_cst, C_euro[8], 500, NULL> gimple_assign <integer_cst, C_euro[9], 1000, NULL> gimple_assign <integer_cst, C_euro[10], 2000, NULL> gimple_assign <integer_cst, C_euro[11], 5000, NULL> gimple_assign <integer_cst, C_euro[12], 10000, NULL> gimple_assign <integer_cst, C_euro[13], 20000, NULL> gimple_assign <integer_cst, C_euro[14], 50000, NULL> It is not clear why an initialization list is not used:
int C_euro[15] = {1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000};
Automatically generated labels (by the gimplifier) and labels defined in the source program are represented differently. The first category are enclosed in single wedges < > while the later are explicitly defined and are used without wedges.
For example, this is the definition of an automatically generated label:
gimple_label <<D.1983>>
while the following is the (redundant) definition:
void V1 = <<< error >>>; ... ``gimple_label <V1>
and the use of a source label:
gimple_cond <eq_expr, D.1985, 1, V5, <D.1986>>
As discussed in the introduction, the tagged GIMPLE format uses alternate syntax for the unoptimized (*.t004.gimple) and certain optimized (e.g. *.t140.gimple) intermediate code dumps. It would be clearer if a single convention throughout all GIMPLE dumps.
The availability of bit-accurate data types is an interesting asset of modern compiler infrastructures such as LLVM: http://www.llvm.org. LLVM uses the LLVM bitcode IR which adheres to such semantics. On the other side, GCC GIMPLE might be too closely coupled with C-like semantics. Especially, implementors of non-conventional backend architectures (e.g. developers of hardware compilers) would be interested in a form of GIMPLE with bit-accurate types.
For example, the following would denote a 14-bit unsigned integer and a 8.16 signed fixed-point representation, respectively.
- u14
- q8.16s
This section is a work-in-progress. Several aspects of programming language translation to low-level GIMPLE are not covered:
NAC (N-Address Code) is the name of a simplistic imperative programming language with light semantics devised by Nikolaos Kavvadias. Its main use is as an executable/interpretable intermediate representation for compilation frameworks (compilers, high-level synthesis tools, etc).
NAC statements are either labels, n-address instructions or procedure calls.
A label is formatted as follows:
An n-address instruction is actually the specification of a mapping from a set of n ordered inputs to a set of m ordered outputs. An n-address instruction (or else termed as an {m, n}-NAC) is formatted as follows:
where
Similarly, a procedure call, which is a non-atomic operation is formatted as follows, in order to distinguished from an atomic operation:
where
For a procedure without input and output arguments, the following notation is used to distinguish it from an atomic operation with no arguments:
NAC is a typed language. Data type specifications are essentially strings that can be easily decoded by a regular expression scanner. All declared objects (global variables, constants, local variables, input and output procedure arguments) have a type specification. Data types in NAC are classified in the following categories:
As of 2010-11-29, there is initial support for the SIGNED_FIXED_POINT and UNSIGNED_FIXED_POINT data types. As of 2012-02-23 there is also initial support (for the C backend) for the FLOATING_POINT data types, especially the IEEE-754 compliant "single" and "double". Support for UNSIGNED_INTEGER and SIGNED_INTEGER data types is considered mature.
In NAC parlance, the following keywords are used:
Please note that the use of constant (declaration of a globally-visible constant value) has been discontinued and will not be supported in the future.
The NAC programming language is extensible, meaning that the grammar accepts user-specific instruction mnemonics.
A common set of NAC instructions is listed below, along with the corresponding format and description.
No-operation: nop
nop;
Performs no action at all.
Move operand: mov
dst1 <= mov src1;
Copy the contents of operand src1 to dst1.
Load constant: ldc
dst1 <= ldc cnst1;
Copy the value of cnst1 to operand dst1.
Unconditional jump: jmpun
S_dst1 <= jmpun;
Jump to label S_dst1.
Conditional jump: jmpeq, jmpne``, jmplt, jmple``, jmpgt, jmpge``
S_TRGT, S_TRGF <= jmpzz src1, src2;
where:
- zz can be one of the following:
- eq: jump if equal
- ne: jump if not equal
- lt: jump if less than
- le: jump if less than or equal
- gt: jump if greater than
- ge: jump if greater than or equal
- src1, src2 are the instruction source operands
- S_TRGT, S_TRGF, are the target addresses for a true and false condition, respectively
Binary logical instructions: and, ior, xor, nand, nor, xnor
dst1 <= <mnemonic> src1, src2;
where:
- mnemonic can be one of the following:
- and: Logical AND
- ior: Logical inclusive-OR
- xor: Logical exclusive-OR
- nand: Logical NAND
- nor: Logical NOR
- xnor: Logical XNOR
- src1, src2 are the source operands
- dst1 is the destination operand
Unary logical instruction: not
dst1 <= not src1;
Copy the 1's-complement of operand src1 to dst1.
Binary arithmetic instructions: add, sub
dst1 <= mnemonic src1, src2;
where:
- mnemonic can be one of the following:
- add: 2's-complement addition
- sub: 2's-complement subtraction
- src1, src2 are the source operands
- dst1 is the destination operand
Unary arithmetic instructions: neg
dst1 <= neg src1;
Copies the negated version of src1 to dst1.
Quaternary multiplexing instruction: mux
dst1 <= muxzz src1, src2, src3, src4;
where:
- zz can be one of the following:
- eq: jump if equal
- ne: jump if not equal
- lt: jump if less than
- le: jump if less than or equal
- gt: jump if greater than
- ge: jump if greater than or equal
- src1, src2, are the source operands compared: if (src1 zz src2)
- src3 is the copy operand when the comparison evaluates to TRUE
- src4 is the copy operand when the comparison evaluates to FALSE
- dst1 is the destination operand
- NOTE: A muxzz is equivalent to the following C code:
if (src1 zz src2) { // zz: "==", "!=", "<", "<=", ">", or ">=" dst1 = src3; } else { dst1 = src4; }
Set on comparison instruction: set
dst1 <= setzz src1, src2;
where:
- zz can be one of the following:
- eq: jump if equal
- ne: jump if not equal
- lt: jump if less than
- le: jump if less than or equal
- gt: jump if greater than
- ge: jump if greater than or equal
- src1, src2, are the source operands compared: src1 zz src2
- src3 is the copy operand when the comparison evaluates to TRUE
- src4 is the copy operand when the comparison evaluates to FALSE
- dst1 is the destination operand (gets a value either 0 or 1).
- NOTE: A setzz is equivalent to the following C code:
``dst1 = (src1 zz src2); // zz: "==", "!=", "<", "<=", ">", or ">="
Complex unary arithmetic instructions: abs
dst1 <= abs src1;
Copies the absolute value of src1 to dst1.
Complex binary arithmetic instructions: max, min
dst1 <= mnemonic src1, src2;
where:
- mnemonic can be one of the following:
- max: Assign the maximum of src1 and src2 to dst1
- min: Assign the minimum of src1 and src2 to dst1
- src1, src2 are the source operands
- dst1 is the destination operand
Shift instructions: shl, shr
dst1 <= mnemonic src1, src2;
where:
- mnemonic can be one of the following:
- shl: Logical left shift of src1 by the amount stored in src2, with the result copied to dst1
- shr: Either logical or arithmetic (depending on the operand data types) shift of src1 by the amount stored in src2, with the result copied to dst1
- src1, src2 are the source operands
- dst1 is the destination operand
Rotate instructions: rotl, rotr
dst1 <= mnemonic src1, src2;
where:
- mnemonic can be one of the following:
- rotl: Left rotation of the value of src1 by the amount stored in src2, with the result copied to dst1
- rotr: Right rotation of the value of src1 by the amount stored in src2, with the result copied to dst1
- src1, src2 are the source operands
- dst1 is the destination operand
Multiplication instructions: mul
dst1 <= mul src1, src2;
Multiplies the contents of src1 and src2 and copies the (possibly truncated) result to dst1.
Combined division-modulus instructions: divrem
dst1, dst2 <= divrem src1, src2;
Divides the contents of src1 and src2 and copies the quotient to dst1 and the remainder to dst2.
Division instructions: div, rem
dst1 <= mnemonic src1, src2;
where:
- mnemonic can be one of the following:
- div: Divides the contents of src1 and src2 and copies the quotient to dst1
- rem: Divides the contents of src1 and src2 and copies the remainder to dst1
- src1, src2 are the source operands
- dst1 is the destination operand
Data type/bitwidth conversion instructions: zxt, sxt, trunc
dst1 <= mnemonic src1;
where:
- mnemonic can be one of the following:
- zxt: Zero-extends src1 to the (larger) bitwidth of dst1
- sxt: Sign-extends src1 to the (larger) bitwidth of dst1
- trunc: Truncates src1 to the (smaller) bitwidth of dst1
- src1 is the source operand
- dst1 is the destination operand
Bit manipulation instructions: bitins, bitext
dst1 <= mnemonic src1, src2, src3;
where:
- mnemonic can be one of the following:
- bitins: Insert a bitvector denoted by the downto range [src2..src3] of src1 to dst1
- bitext: Extract a bitvector denoted by the downto range [src2..src3] from src1 and assign it to dst1
- src1 is the source operand
- src2 are two source operands (constant or variables) that denote the downto range. The runtime numerical value of src2 must be larger or equal to src3, and within the range of dst1
- dst1 is the destination operand
These instructions define bitfield insertion and extraction primitives. They can also be defined for fixed-point operands given additional constraints.
Load variable from array: load
dst1 <= load src1, src2;
Loads the contents of array src1 from the absolute address src2 to the variable dst1.
Store variable to array: store
dst1 <= store src1, src2;
Stores the value of variable src1 to address src2 of array dst1.
The use of fixed-point arithmetic provides an inexpensive means for improved numerical dynamic range, when artifacts due to quantization and overflow effects can be tolerated. Rounding operators are used for controlling the numerical precision involved in a series of computations; they are defined for inexact arithmetic representations such as fixed- and floating-point. Proposed and in-use specifications for fixed-point arithmetic of related practice include C99 and the Mentor ACDT (Algorithmic C Data Types).
Fixed-point arithmetic is a variant of the typical integral representation (2's-complement signed or unsigned) where a binary point is defined, purely as a notational artifact to signify integer powers of 2 with a negative exponent. Assuming an integer part of width IW > 0 and a fractional part with -FW < 0, the VHDL-2008 sfixed data type has a range of 2^{IW-1}-2^{|FW|} to -2^{IW-1} with a representable quantum of 2^|FW|. The corresponding ufixed type has the following range: 2^{IW}-2^{|FW|} to 0. Both are defined properly given a [IW-1:-FW] vector range.
This is a proposed list of extension operators for use with fixed-point variables support in the NAC programming language.
Conversion from integer to fixed-point format: i2ufx, i2sfx
dst1 <= i2zfx src1;
where:
- z can be one of the following:
- u: conversion to the ufixed (UNSIGNED_FIXED_POINT) format
- s: conversion to the sfixed (SIGNED_FIXED_POINT) format
- src1 is the source operand
- dst1 is the destination operand
Converts an integer to a fixed-point number without loss of precision.
Conversion from fixed-point to integer format: ufx2i, sfx2i
dst1 <= zfx2i src1;
where:
- z can be one of the following:
- u: conversion to the ufixed (UNSIGNED_FIXED_POINT) format
- s: conversion to the sfixed (SIGNED_FIXED_POINT) format
- src1 is the source operand
- dst1 is the destination operand
Converts a fixed-point number to an integer. In case of a non-zero fractional part of the fixed-number, truncation occurs. The type of the integer result (UNSIGNED_INTEGER or SIGNED_INTEGER) must be compatible to the type of the fixed-point input argument to assure a proper conversion.
Resize instruction: resize
dst1 <= resize src1, src2, src3;
where:
- src1 is the source fixed-point operand
- src2, src3 are numerical values (integers) that denote the new size (high-to-low range) of the resulting fixed-point operand
- dst1 is the destination fixed-point operand
Fixed-point rounding instructions: ceil, fix, floor, round, nearest, convergent
dst1 <= mnemonic src1;
where:
- src1 is the source operand
- dst1 is the destination operand
These operations are used to performing rounding of fixed-point operands with different criteria. They emulate the behavior of corresponding MATLAB intrinsic functions. Rounding behavior is summarized as follows:
- ceil: round towards plus infinity.
- fix: round towards zero.
- floor: round towards minus infinity.
- round: round to nearest; ties to greatest absolute value.
- nearest: round to nearest; ties to plus infinity.
- convergent: round to nearest; ties to closest even.
For simplifying programming in the NAC language, a set of macroinstructions are available:
A) Automatic replacement of incomplete conditional jumps: The pattern
S_TRUE <= jmpxx opnd1, opnd2;
is replaced by:
S_TRUE, S_FALSE <= jmpxx opnd1, opnd2;S_FALSE:
Label S_FALSE is generated only if it doesn't already exist.
B) Addition of "forgotten" unconditional jumps. The pattern:
no-jump-instruction;LABEL:
is replaced by:
no-jump-instruction;LABEL <= jmpun;LABEL:
A NAC program can be specified in a single source file that can contain global variable definitions and their initializations, and a list of procedures. Each procedure is comprised of the following: - the procedure name - a list of ordered input arguments - a list of ordered output arguments - a list of localvar declarations - a list of statements (the main NAC subprogram) - basic block labels
Label items point to basic block (BB) entry points and are defined as name, bb, addr 3-tuples, where name is the corresponding identifier, bb the basic block enumeration, and addr the absolute address of the statement succeeding the label.
Statements are organized in the form of a C struct or equivalently a record (in other programming languages) as follows:
typedef struct { char *mnemonic; /* Designates the statement type. */ NodeType ntype; /* OPERATION or PROCEDURE_CALL. */ List opnds_in; /* Collects all input operands. */ List opnds_out; /* Collects all output operands. */ int bb; /* Basic block number. */ int addr; /* Absolute statement address. */ } _Statement; typedef _Statement *Statement; C-style record for encoding a BASIL statement.
The Statement ADT therefore can be used to model an (n,m)-operation. The input and output operand lists collect operand items, as defined in the OperandItem data structure definition:
typedef struct { char *name; /* Identifier name. */ char *dataspec; /* Data type string spec. */ OperandType otype; /* Operand type representation. */ int ix; /* Absolute operand item index. */ } _OperandItem; typedef _OperandItem *OperandItem; C-style record for encoding an OperandItem.
The OperandItem data structure is used for representing input arguments (INVAR), output arguments (OUTVAR), local (LOCALVAR) and global (GLOBALVAR) variables and constants (CONSTANT). If using a graph-based intermediate representation, arguments and constants could use node and incoming or outgoing edge representations, while it is meaningful to represent variables as edges as long as their storage sites are not considered.
The typical NAC program is structured as follows:
<Global variable declarations> procedure <name-1> ( <comma-separated input arguments>, <comma-separated output arguments> ) { <Local variable declarations> <NAC labels, instructions and procedure calls> } ... procedure <name-n> ( <comma-separated input arguments>, <comma-separated output arguments> ) { <Local variable declarations> <NAC labels, instructions and procedure calls> }
Since version 0.0.3 of the standalone manual for the NAC programming language, the need for declaring constant items has been eliminated, and for this reason constant items are recognized by scanning through the NAC program prior any actual further manipulations (e.g. code generation). A small set of simple rules are used for data type inference of constant values:
1. When a constant appears in an "ldc" or "store" operation, it obtains the type of the result operand.
2. When a constant appears in any other operation, then it obtains the type of the first input operand. This assumes that the constant appears only as the second, third or fourth input operand for this operation.
Here follows the BNF-style grammar specification for the NAC programming language.
This grammar uses the notation of the YACC/Bison parser generators.
%token T_LPAREN T_RPAREN T_LBRACE T_RBRACE T_LBRACKET T_RBRACKET %token T_COMMA T_COLON T_SEMI T_ASSIGN T_EQUAL %token T_PROCEDURE T_LOCALVAR T_GLOBALVAR T_CONSTANT T_IN T_OUT %token T_ID %start nac_top %% nac_top : procedure_list | globalvar_def procedure_list ; globalvar_def : globalvar_prefix id_list T_SEMI | globalvar_def globalvar_prefix id_list T_SEMI ; globalvar_prefix : T_GLOBALVAR type_spec ; procedure_def : procedure_prefix T_LPAREN arg_list T_RPAREN T_LBRACE stmt_list T_RBRACE | procedure_prefix T_LPAREN arg_list T_RPAREN T_LBRACE localvar_list stmt_list T_RBRACE ; procedure_list : procedure_def | procedure_list procedure_def ; procedure_prefix : T_PROCEDURE id ; localvar_list : localvar_prefix id_list T_SEMI | localvar_list localvar_prefix id_list T_SEMI ; localvar_prefix : T_LOCALVAR type_spec ; stmt_list : /* empty */ | stmt_list stmt ; stmt : nac | pcall | label ; nac : opnd_out_list assign_op id opnd_in_list T_SEMI | opnd_out_list assign_op id T_SEMI | id opnd_in_list T_SEMI | id T_SEMI ; pcall : T_LPAREN opnd_out_list T_RPAREN assign_op id T_LPAREN opnd_in_list T_RPAREN T_SEMI | T_LPAREN opnd_out_list T_RPAREN assign_op id T_SEMI | id T_LPAREN opnd_in_list T_RPAREN T_SEMI | T_LPAREN T_RPAREN assign_op id T_LPAREN T_RPAREN T_SEMI ; assign_op : T_ASSIGN ; label : id T_COLON ; opnd_out_list : id_list ; opnd_in_list : id_list ; arg_list : /* empty */ | arg_in | arg_out | arg_list T_COMMA arg_in | arg_list T_COMMA arg_out ; arg_in : T_IN type_spec id ; arg_out : T_OUT type_spec id ; id_list : id | id_list T_COMMA id ; id : T_ID ; type_spec : T_ID ;
This grammar follows the EBNF notation as used by N. Wirth.
nac_top = {gvar_def} {proc_def}. gvar_def = "globalvar" anum decl_item_list ";". proc_def = "procedure" [anum] "(" [arg_list] ")" "{" [{lvar_decl}] [{stmt}] "}". stmt = nac | pcall | id ":". nac = [id_list "<="] anum [id_list] ";". pcall = ["(" id_list ")" "<="] anum ["(" id_list ")"] ";". id_list = id {"," id}. decl_item_list = decl_item {"," decl_item}. decl_item = (anum | uninitarr | initarr). arg_list = arg_decl {"," arg_decl}. arg_decl = ("in" | "out") anum (anum | uninitarr). lvar_decl = "localvar" anum decl_item_list ";". initarr = anum "[" id "]" "=" "{" numer {"," numer} "}". uninitarr = anum "[" [id] "]". anum = (letter | "_") {letter | digit}. id = anum | ["-"] numeric. numeric = (integer | fxpnum). fxpnum = [integer] "." integer. integer = digit {digit}.
eda.nac is the N-address code (NAC) implementation for a 2D Euclidean distance approximation algorithm given by the equation: eda = MAX(0.875*x+0.5*y, x) where x = MAX(|a|,|b|), y = MIN(|a|,|b|).
procedure eda (in s16 in1, in s16 in2, out u16 out1) { localvar u16 x, y, t1, t2, t3, t4, t5, t6, t7; localvar s16 a, b; S_1: a <= mov in1; b <= mov in2; t1 <= abs a; t2 <= abs b; x <= max t1, t2; y <= min t1, t2; t3 <= shr x, 3; t4 <= shr y, 1; t5 <= sub x, t3; t6 <= add t4, t5; t7 <= max t6, x; out1 <= mov t7; }
fibo.nac is the N-address code (NAC) implementation for the iterative version of Fibonacci series computation.
procedure fibo(in u31 n, out u31 outp) { localvar u31 res, x; localvar u31 f0, f1, f, k; LL0: x <= mov n; f0 <= ldc 0; f1 <= ldc 1; res <= mov f0; S_EXIT, LL1 <= jmple x, 0; LL1: res <= mov f1; S_EXIT, LL2 <= jmpeq x, 1; LL2: k <= ldc 2; LL3 <= jmpun; LL3: f <= add f1, f0; f0 <= mov f1; f1 <= mov f; res <= mov f; k <= add k, 1; LL3, S_EXIT <= jmple k, x; S_EXIT: outp <= mov res; }
The following computes the sum of the elements of array arr[10], that is the sum of the first ten primes.
globalvar s32 arr[10]={2,3,5,7,11,13,17,19,23,27}; procedure main (in s32 in1, out s32 out1) { localvar s32 D_1963; localvar s32 i; localvar s32 sum; L0001: sum <= ldc 0; i <= ldc 0; D_1221 <= jmpun; D_1220: i0 <= mov i; D_1963 <= load arr, i; sum <= add sum, D_1963; i <= add i, 1; D_1221 <= jmpun; D_1221: D_1220, D_1222 <= jmplt i, in1; D_1222: out1 <= mov sum; }
Here follows a list of suggestions for easier programming and code generation in NAC.
And some notes clarifying some issues for potential hardware implementations.
HercuLeS expects that the user resorts to a synthesis-friendly coding style with the following basic rules:
This version comes with certain intentional limitations. Here is a quick list:
The FSMD (Finite-State Machine with Datapath) is a model of computation model which is universal, well-defined and suitable for either data- or control- dominated applications. The generated hardware architectures from HercuLeS follow an extended FSMD model as presented in this section.
This section deals with aspects of specification and design of FSMDs, especially their interface, architecture and organization, as well as communication and integration issues. The section is wrapped-up with realistic examples of CDFG mappings to FSMDs, alongside their performance investigation with the help of HDL simulations.
An FSMD specification is an upgraded version of the well-known FSM representation providing the same information as the equivalent CDFG. The main difference is the introduction of embedded actions within the next state generation logic. An FSMD specification is timing-aware since it must be decided that each state is executed within a certain amount of machine cycles. Also the precise RTL semantics of operations taking place within these cycles must be determined. In this way, an FSMD can provide an accurate model of an RTL design's performance as well as serve as a synthesizable manifestation of the designer's intent. Depending on the RT-level specification (usually VHDL or Verilog) it can convey sufficient details for hardware synthesis to a specific target platform, e.g. Xilinx FPGA devices.
HercuLeS FSMDs follow the established scheme of a Mealy FSM with computational actions embedded within state logic. In this work, the extended FSMD MoC describing the hardware architectures supports the following features, the most relevant of which will be sufficiently described and supported by short examples:
- Support of scalar and array input and output ports.
- Support of streaming inputs and outputs and allowing mixed types of input and output ports in the same design block.
- Communication with embedded block and distributed LUT memories.
- Design of a latency-insensitive local interface of the FSMD units to master FSMDs, assuming the FSMD is a locally-interfaced slave.
- Design of memory interconnects for the FSMD units.
The FSMDs of our approach use fully-synchronous conventions and register all their outputs. The control interface is rather simple, yet can service all possible designs:
clk: signal from external clocking source
- reset (rst or arst): synchronous or asynchronous reset,
depending on target specification
ready: the block is ready to accept new input
valid: asserted when a certain data output port is streamed-out from the block (generally it is a vector)
done: end of computation for the block
ready signifies only the ability to accept new input (non-streamed) and does not address the status of an output (streaming or not).
Multi-dimensional data ports are feasible based on their equivalent single- dimensional flattened array type definition. Then, port selection is a matter of bitfield extraction. For instance, data input din is defined as din: in std_logic_vector(M*N-1 downto 0);, where M,``N`` are generics. The flattened vector defines M input ports of width N. A selection of the form din((i+1)*N-1 downto i*N) is typical for a for-generate loop in order to synthesize iterative structures.
The following example illustrates an element-wise copy of array b to c without the use of a local array resource. Each interface array consists of 10 elements. It should be assumed that the physical content of both arrays lies in distributed LUT RAM, from which custom connections can be implemented.
Fig. fsmd-arrif-nac illustrates the corresponding function func1. The VHDL interface of func1 is shown in Fig. fsmd-arrif-vhdl where the the derived array types b_type and c_type are used for b, c, respectively. The definitions of these types can be easily devised as aliases to a basic type denoted as: type cdt_type is array (9 downto 0) of std_logic_vector(31 downto 0);. Then, the alias for b is: alias b_type is cdt_type;
Array-to-array copy without intermediate storage (NAC).
procedure func1 (in s32 b[10], out s32 c[10]) { localvar s32 i, t; S_1: i <= ldc 0; S_2 <= jmpun; S_2: S_3, S_EXIT <= jmplt i, 10; S_3: t <= load b, i; c <= store t, i; i <= add i, 1; S_2 <= jmpun; S_EXIT: nop; }
Array-to-array copy without intermediate storage (VHDL interface).
entity func1 is port ( clk : in std_logic; reset : in std_logic; start : in std_logic; b : in b_type; c : out c_type; done : out std_logic; ready : out std_logic ); end func1;
The FSMDs are organized as computations allocated into n+2 states, where n is the number of required control steps as derived by an operation scheduler. The two overhead states are the entry (S_ENTRY) and the exit (S_EXIT) states which correspond to the source and sink nodes of the CDFG of the given procedure, respectively.
Fig. fsmd-minimal shows the absolute minimal example of a compliant FSMD written in VHDL. The FSMD is described in a two-process style using one process for the current state logic and another process for a combined description of the next state and output logic. This code will serve as a running example for better explaining the basic concepts of the FSMD paradigm.
The example of Fig. fsmd-minimal-vhdl implements the computation of assigning a constant value to the output port of the FSMD: outp <= ldc 42;. Thus, lines 5--14 declare the interface (entity) for the hardware block, assuming that outp is a 16-bit quantity. The FSMD requires three states. In line 17, a state type enumeration is defined consisting of types S_ENTRY, S_EXIT and S_1. Line 18 defines the signal 2-tuple for maintaining the state register, while in lines 19--20 the output register is defined. The current state logic (lines 25--34) performs asynchonous reset to all storage resources and assigns new contents to both the state and output registers. Next state and output logic (lines 37--57) decode current_state in order to determine the necessary actions for the computational states of the FSMD. State S_ENTRY is the idle state of the FSMD. When the FSMD is driven to this state, it is assumed ready to accept new input, thus the corresponding status output is raised. When a start prompt is given externally, the FSMD is activated and in the next cycle, state S_1 is reached. In S_1 the action of assigning CNST_42 to outp is performed. Finally, when state S_EXIT is reached, the FSMD declares the end of all computations via done and returns to its idle state.
It should be noted that this design approach is a rather conservative one. One possible optimization that can occur in certain cases is the merging of computational states that immediately prediate the sink state (S_EXIT) with it.
Fig. fsmd-minimal-timediag shows the timing diagram for the minimal design. As expected, the overall latency for computing a sample is three machine cycles.
Minimal FSMD implementation.
Minimal FSMD implementation in VHDL.
library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity minimal is port ( clk : in std_logic; reset : in std_logic; start : in std_logic; outp : out std_logic_vector(15 downto 0); done : out std_logic; ready : out std_logic ); end minimal; architecture fsmd of minimal is type state_type is (S_ENTRY, S_EXIT, S_1); signal current_state, next_state: state_type; signal outp_next: std_logic_vector(15 downto 0); signal outp_reg: std_logic_vector(15 downto 0); constant CNST_42: std_logic_vector(15 downto 0) := "0000000000101010"; begin -- current state logic process (clk, reset) begin if (reset = '1') then current_state <= S_ENTRY; outp_reg <= (others => '0'); elsif (clk = '1' and clk'EVENT) then current_state <= next_state; outp_reg <= outp_next; end if; end process; -- next state and output logic process (current_state, start, outp_reg) begin done <= '0'; ready <= '0'; outp_next <= outp_reg; case current_state is when S_ENTRY => ready <= '1'; if (start = '1') then next_state <= S_1; else next_state <= S_ENTRY; end if; when S_1 => outp_next <= CNST_42; next_state <= S_EXIT; when S_EXIT => done <= '1'; next_state <= S_ENTRY; end case; end process; outp <= outp_reg; end fsmd;
In certain cases, input registering might be desired. This intent can be made explicit by copying input port data to an internal register. For the case of the eda algorithm, a new localvar, a would be introduced to perform the copy as a <= mov in1;. The VHDL counterpart is given as a_1_next <= in1;, making this data available through register a_1_reg in the following cycle. For register r, signal r_next represents the value that is available at the register input, and r_reg the stored data in the register.
Array objects can be synthesized to block RAMs in contemporary FPGAs. These embedded memories support fully synchronous read and write operations. A requirement for asynchronous read mandates the use of memory residing in distributed LUT storage.
In BASIL, the load and store primitives are used for describing read and write memory access. We will assume a RAM memory model with write enable, and separate data input (din) and output (dout) sharing a common address port (rwaddr). To control access to such block, a set of four non-trivial signals is needed: mem_we, a write enable signal, and the corresponding signals for addressing, data input and output.
store is the simpler operation of the two. It requires raising mem_we in a given single-cycle state so that data are stored in memory and made available in the subsequent state/machine cycle.
Synchronous load requires the introduction of a waitstate register. This register assists in devising a dual-cycle state for performing the load. Fig. fsmd-loadstore-vhdl illustrates the implementation of a load operation. During the first cycle of STATE_1 the memory block is addressed. In the second cycle, the requested data are made available through mem_dout and are assigned to register mysignal. This data can be read from mysignal_reg during STATE_2.
Wait-state-based communication for loading data from a block RAM.
when STATE_1 => mem_addr <= index; waitstate_next <= not (waitstate_reg); if (waitstate_reg = '1') then mysignal_next <= mem_dout; next_state <= STATE_2; else next_state <= STATE_1; end if; when STATE_2 => ...
Our extended FSMD concept allows for hierarchical FSMDs defining entire systems with calling and callee CDFGs. A two-state protocol can be used to describe a proper communication between such FSMDs. The first state is considered as the preparation state for the communication, while the latter state actually comprises an evaluation superstate where the entire computation applied by the callee FSMD is effectively hidden.
The calling FSMD performs computations where new values are assigned to *_next signals and registered values are read from *_reg signals. To avoid the problem of multiple signal drivers, callee procedure instances produce *_eval data outputs that can then be connected to register inputs by hardwiring to the *_next signal.
Fig. fsmd-pcall-vhdl illustrates a procedure call to an integer square root evaluation procedure. This procedure uses one input and one output std_logic_vector operands, both considered to represent integer values. Thus, a procedure call of the form (m) <= isqrt(x); is implemented by the given code segment.
State-superstate-based communication of a caller and callee procedure instance in VHDL.
when STATE_1 => isqrt_start <= '1'; next_state <= SUPERSTATE_2; when SUPERSTATE_2 => if ((isqrt_ready = '1') and (isqrt_start = '0')) then m_next <= m_eval; next_state <= STATE_3; else next_state <= SUPERSTATE_2; end if; when STATE_3 => ... isqrt_0 : entity WORK.isqrt(fsmd) port map ( clk, reset, isqrt_start, x_reg, m_eval, isqrt_done, isqrt_ready );
STATE_1 sets up the callee instance. The following state is a superstate where control is transferred to the component instance of the callee. When the callee instance terminates its computation, the ready signal is raised. Since the start signal of the callee is kept low, the generated output data can be transferred to the m register via its m_next input port. Control then is handed over to state STATE_3.
The callee instance follows the established FSMD interface, reading x_reg data and producing an exact integer square root in m_eval. Multiple copies of a given callee are supported by versioning of the component instances.
ANSI C is the archetypical example of a general-purpose imperative language that does not support streaming primitives, i.e. it is not possible for someone to express and process streams solely based on the semantics of such language.
Streaming suits applications with absence of control flow. In a prime factorization algorithm (pfactor), a streaming output can be used, outp, to produce successive factors. The streaming port is accessed based on valid. Thus, outp is accessed periodically in context of basic block BB4 as shown in fsmd-pfactor-nac.
NAC code for a prime factorization algorithm involving output streaming.
procedure pfactor (in u16 x, out u16 outp) { localvar u16 i, n, t0; BB1: n <= mov x; i <= ldc 2; BB2 <= jmpun; BB2: BB3, BB_EXIT <= jmple i, n; BB3: t0 <= rem n, i; BB4, BB5 <= jmpeq t0, 0; BB4: n <= div n, i; outp <= mov i; BB3 <= jmpun; BB5: i <= add i, 1; BB2 <= jmpun; BB_EXIT: nop; }
Operation chaining assigns dependent SSA operations to a single control step. Simple means for selective operation chaining involve merging successive ASAP states. In successive states, intermediate registers are eliminated by wiring assignments to *_next signals and reusing them in the subsequent chained computation, instead of reading from the stored *_reg value. To avoid excessive critical paths, a heuristic is defined for disallowing flow-dependent multiple occurrences of expensive operators in the same newly defined state.
In Fig. fsmd-eda-chaining states S_1_3 to S_1_5 comprise intermediate computations in a merged S_1_1 state.
2D euclidean distance approximation algorithm (eda) without chained computations.
... when S_1_3 => t3_next <= "000"&x_reg(15 downto 3); t4_next <= "0"&y_reg(15 downto 1); next_state <= S_1_4; when S_1_4 => t5_next <= x_reg - t3_reg; next_state <= S_1_5; when S_1_5 => t6_next <= t4_reg + t5_reg; next_state <= S_1_6;
2D euclidean distance approximation algorithm (eda) with chained computations.
when S_1_1 => ... t3_next <= "000"&x_next(15 downto 3); t4_next <= "0"&y_next(15 downto 1); t5_next <= x_next - t3_next; t6_next <= t4_next + t5_next; ...
The HercuLeS 1.0 (2013a) distribution includes a graphical user interface (GUI) for allowing user-friendly access to HercuLeS HLS without the burden of coping with command-line syntax. The main purpose of the GUI is for the user to control code generation, simulation and synthesis options via an intuitive scheme. The user sets various options for the overall process from within the GUI (by interacting with checkbuttons, radiobuttons, entries, text widgets etc) in order for a shell script to be generated which will steer these tasks transparently. For running the generated script, a minimal Unix bash script environment is expected. On Windows, the MinGW and msys distributions are suggested. On Linux, the required facilities are natively supported in almost any distributions, including Ubuntu Linux 12.04 LTS.
To summarize, the HercuLeS GUI performs the following tasks:
The HercuLeS GUI can be accessed by double-clicking on the icon of the hercules.exe executable, or by command-line invocation as follows:
from within the top-level directory of your HercuLeS installation.
The HercuLeS GUI executable is available on both 32-bit Windows and 32-bit Linux.
When executing hercules.exe, a splashscreen appears for a few seconds, as shown in Fig. hercules-gui-splashscreen.
After the lapse of a few seconds, the basic configuration screen of HercuLeS is visible. A nominal view of the GUI is shown in Fig. hercules-gui-basicscreen. The GUI consists of the following:
A major element of the HercuLeS GUI which is not readily visible is the Results browser. This element is activated after a successful run of generating and executing a HercuLeS run script for a specific C or NAC translation unit.
It should be noted that context-specific balloon help is available for most visible controls. This kind of help of accessible simply by mouse hovering over the corresponding GUI element.
The fastest and simplest way to use the HercuLeS GUI is a four-step process. Using this process, the user is able to perform C simulation, VHDL simulation and logic synthesis on the generated C and VHDL representation of a specified C or NAC program file.
The process is as follows:
Fig. hercules-gui-quickstart depicts graphically the proposed four-step process for quickly setting up and processing a program file with HercuLeS.
A set of basic controls are always visible as part of the basic screen layout of the HercuLeS GUI. Fig. hercules-gui-framedcontrols illustrates all the available basic framed controls.
The Simulator control provides a choice between different HDL simulators. Currently, only VHDL simulation is needed since only VHDL RTL code is generated as a result of the high-level synthesis process. The available choices for simulators are GHDL (for the GHDL simulator) and Modelsim (for Mentor Modelsim). It is expected that both of these simulators (or at least the one that is intended for use) is already installed on the host system and its executables directory is declared within the PATH environmental variable.
The Output waveform format control allows to choose between three choices for generating or not waveform data from the HDL simulation:
VCD and GHW are both supported by recent versions of the GTKwave waveform viewer.
The Simulation and synthesis options control is a set of checkbuttons to enable or disable the generation of corresponding entries in the HercuLeS run script for the following:
The Miscellaneous options control groups all remaining controls. Currently only Generate test data is available. When enabled, this checkbutton enables using the host C compiler for generating reference test input/output data for the application under processing.
In order to control in detail the available high-level synthesis settings, a notebook (currently consisting of four tabs) is always accessible from the basic screen layout of the HercuLeS GUI.
These tabs organize General, Optimizations, Operation scheduling and Code generation controls into corresponding categories.
From left to right, the first notebook tab is General which covers general file emission options. It is shown in Fig. hercules-gui-nb-general and consists of three option groups.
The first option group, Generate intermediate NAC files, can be used to enable intermediate dumps after various stages of the HLS compilation process. Currently on printing intermediate NAC files for each procedure following register allocation can be specified.
The second option group, Graphical information, can be used for enabling and disabling of the emission of graphical information at the CDFG (Control-Data Flow Graph), CFG (Control-Flow Graph) and Call graph level. A CDFG illustates control and data dependencies between the statements in a NAC procedure. A CFG illustrates only the control dependencies at basic block granularity in a procedure. A call graph depicts the call graph structure of the entire translation unit.
The third option group, Floating-point arithmetic, provides control settings for the support of floating-point arithmetic (Floating-point arithmetic support). Another checkbutton, Floating-point builtin algorithmic functions enables the rewriting of NAC programs so that transcendental standard C library functions (such as calls to sin() and atan()) to be replaced by non-synthesizable implementations provided by proposed extensions to the VHDL-2008 floating-point arithmetic package.
The second notebook tab is Optimizations which covers optimization-specific options. All depicted optimizations regard external optimizers that are bundled within the HercuLeS distributions for transforming C, NAC, or Graphviz representations. The notebook view is shown in Fig. hercules-gui-nb-opt and consists of three option groups.
This tab basically groups a set of checkbuttons for enabling or disabling specific optimizations. The first checkbutton, Enable optimizations, enables or disables the entire group of the following checkbuttons. The supported optimizations are as follows:
The third notebook tab is Operation scheduling which covers operation scheduling and memory synthesis options. Overall, this tab is dedicated for the setting of options that control aspects of the generated HDL architectures. A typical tab view is shown in Fig. hercules-gui-nb-arch and consists of two option groups.
The first option group, Operation scheduling method, can be used for selecting one of the provided schedulers for the task of operation scheduling. Currently Sequential and ASAP scheduling are supported. Sequential schedules one operation per FSMD state. ASAP is a form of unconstrained scheduling and allows for mutually-independent operations to be bundled within the same FSMD control step (state).
The Operation chaining checkbutton enables a heuristic that allows to collapse multiple dependent operations within the same FSMD control step. In some cases, this technique leads to overcontention of specific FSMD states and subsequently to lower performance (e.g. reduced clock period due to larger combinational path). Basic block partitioning contributes a heuristic so that existing basic blocks are split into smaller ones in order to alleviate for this problem.
The second option group is named Memory synthesis options and is used for controlling the generation of RAM description that support automatic block RAM inference. This mandates the use of synchronous read descriptions. Xilinx block RAM support different read schemes. The corresponding combobox, Block RAM generation enables the user to select among two different schemes, Read-first and Read-through. Both schemes are explained in the Xilinx XAPP463 application note on block RAM usage.
The fourth and last notebook tab is Code generation which covers options that affect code generation in HercuLeS. A view of this tab is shown in Fig. hercules-gui-nb-cgen and consists of three option groups regarding backend code generation, SSA (Static Single Assignment) and register allocation.
The first option group, NAC, C and VHDL code generation, provides checkbuttons for controlling the emission of NAC, backend ANSI C and RTL VHDL code. NAC and ANSI C backend code are generated by the nac2cdfg tool, while RTL VHDL is generated by the graph-based cdfg2hdl backend tool of HercuLeS.
This is followed by SSA construction options, a control set for fine-tuning the generated intermediate representation form by nac2cdfg. Generating SSA form is mandatory for non-sequential operation scheduling like the ASAP scheme. Apart from the Enable SSA construction checkbutton, a combobox permits to choose among two different methods for IR construction, classic minimal SSA which can be generated by using the Aycock-Horspool technique and Pseudo SSA which is a form of intrablock variable numbering. The latter is much faster than the former, however not a true SSA scheme, since the single definition point property of SSA form is not sustained at interblock scope. Another checkbutton, Hardware implementation of PHI functions allows for a direct mapping of SSA form (which involves so-called PHI functions which are join points for variable definitions from different control paths) to hardware. The default choice is to first convert SSA form out-of-SSA. In this case, PHI functions are removed and move operations must be placed to construct all necessary variable copies.
Register allocation options is a set of controls for setting whether register allocation is to be applied. When register allocation is disabled, simply its temporary variable is translated to a hardware register. To perform register allocation the Enable register allocation checkbutton must be selected. Currently only linear-scan register allocation (Linear scan) can be selected from the corresponding register allocation method combobox.
A set of five action buttons, Run HercuLeS, Results browser, Clear generated files, Clear console (on the left), and Exit (on the right) is visible near the bottom of the basic screen layout. Fig. hercules-gui-basicscreen-actions illustrates the corresponding controls.
The actions performed by Run HercuLeS and Results browser have been already described. Clear generated files deletes all generated result files from the working directory of the loaded C or NAC program file. Clear console deletes all information that has been emitted in the read-only console. Exit forces the HercuLeS GUI to close. It should be noted that Results browser, Clear generated files and Clear console are only enabled following the execution of a HercuLeS script.
The HercuLeS GUI comes with a results browser which is invoked after the execution of a HercuLeS run script has completed. Fig. hercules-gui-resbrowser illustrates an example view of the results browser layout.
On the left side, the results browser uses a tree viewing GUI element. This tree viewer loads a file named hlsrestree.tcl from the application's current (working) directory. This file provides an automatically-generated GRIDPLUS tree widget to enable browsing of generated files from the high-level synthesis process.
An example hlsrestree.tcl automatically generated for a Fibonacci series (fibo) computation application is shown below:
gpset .treebrowser.hlstree { {/PRGM + "Program: fibo" :folder_blue-22} {/PRGM/RPTS + "Reports" :folder_graph_white-22} {/PRGM/RPTS/RESANSIC "ANSI C simulation results" :txt-22} {/PRGM/RPTS/RESVHDL "VHDL simulation results" :txt-22} {/PRGM/RPTS/RESXST "XST/ISE synthesis report" :txt-22} {/PRGM/PROC0 + "Procedure: fibo" :folder_grey-22} {/PRGM/PROC0/CDFG + "CDFG: fibo" :folder_red-22} {/PRGM/PROC0/CDFG/DOT "Graphviz dot" :dot-22} {/PRGM/PROC0/CDFG/PNG "PNG image" :png-22} {/PRGM/PROC0/CFG + "CFG: fibo" :folder_orange-22} {/PRGM/PROC0/CFG/DOT "Graphviz dot" :dot-22} {/PRGM/PROC0/CFG/PNG "PNG image" :png-22} {/PRGM/PROC0/NAC + "NAC representation" :folder_green-22} {/PRGM/PROC0/NAC/FINAL "Optimized NAC : fibo.nac" :nac-22} {/PRGM/PROC0/NAC/POSTFE "NAC: fibo_ssa.nac" :nac-22} {/PRGM/PROC0/ANSIC + "ANSI C representation" :folder_olive-22} {/PRGM/PROC0/ANSIC/FINAL "C backend code for fibo: fibo_nac.c" :ansic-22} {/PRGM/PROC0/VHDL + "Generated VHDL files for fibo" :folder_magenta-22} {/PRGM/PROC0/VHDL/RTL "RTL design for fibo" :vhdl-22} {/PRGM/PROC0/VHDL/CDTPKG "Compound data structures package for fibo" :vhdl-22} {/PRGM/CG + "Call graph: fibo" :folder_yellow-22} {/PRGM/CG/DOT "Graphviz dot" :dot-22} {/PRGM/CG/PNG "PNG image" :png-22} {/PRGM/GLOBAL + "Global miscellaneous files for fibo" :folder_white-22} {/PRGM/GLOBAL/HERCSH "Generated HercuLeS bash script for fibo" :bash-22} {/PRGM/GLOBAL/PROCNMS "Procedure names in fibo" :txt-22} {/PRGM/GLOBAL/BUILTINS "Black box procedures in fibo" :txt-22} {/PRGM/GLOBAL/TESTDATA "Reference I/O data for fibo" :txt-22} {/PRGM/GLOBAL/MAINC "C driver code for fibo: main.c" :ansic-22} {/PRGM/GLOBAL/MAINH "C header code for fibo: main.h" :ansic_header-22} {/PRGM/GLOBAL/MK "Makefile for builting C backend code: ansic.mk" :makefile-22} {/PRGM/GLOBAL/RSIMDO ".do script for VHDL simulation using Modelsim" :txt-22} {/PRGM/GLOBAL/RSIMSH "Bash script for running the VHDL simulation" :bash-22} {/PRGM/GLOBAL/XSTSH "Bash script for running logic synthesis with Xilinx XST/ISE" :bash-22} {/PRGM/VHDL + "Global VHDL files for fibo" :folder_cyan-22} {/PRGM/VHDL/PKG "Global package for fibo" :vhdl-22} {/PRGM/VHDL/TB "Testbench for fibo" :vhdl-22} }
On the right side, the user can view either a textual or a graphical representation (the latter when appropriate) of the requested information. Currently, the Graphical view and Statistics views are left unused. It should be noted that when requesting a PNG visualization of a graph (e.g. a CDFG, CFG or call graph), an external image viewer is accessed, the name of and the path to which can be defined via means of the initialization file or the external tools configuration layout.
The following table summarizes all automatically generated files that are accessible from the results browser environment. The name of the current application is app.
Path to tree widget | Description |
/PRGM | Top-level folder for app. |
/PRGM/REPORTS | Reports folder. |
/PRGM/REPORTS/DEBUG /PRGM/REPORTS/RESANSIC /PRGM/REPORTS/RESVHDL /PRGM/REPORTS/RESXST |
Debug dump. ANSI C simulation results. VHDL simulation results. XST/ISE synthesis report. |
/PRGM/PROC$i | Top-level folder for procedure proc. |
/PRGM/PROC$i/CDFG | CDFG folder for proc. |
/PRGM/PROC$i/CDFG/DOT /PRGM/PROC$i/CDFG/PNG |
Graphviz dot representation for the CDFG. PNG image visualization of the Graphviz CDFG. |
/PRGM/PROC$i/CFG | CFG folder for proc. |
/PRGM/PROC$i/CFG/DOT /PRGM/PROC$i/CFG/PNG |
Graphviz dot representation for the CFG. PNG image visualization for the CFG. |
/PRGM/PROC$i/NAC | Folder for the NAC representation of proc. |
/PRGM/PROC$i/NAC/FINAL /PRGM/PROC$i/NAC/POSTFE /PRGM/PROC$i/NAC/RA |
Optimized NAC for proc. NAC following frontend processing for proc. NAC following register allocation for proc. |
/PRGM/PROC$i/ANSIC | Folder for the ANSI C representation of proc. |
/PRGM/PROC$i/ANSIC/FINAL | C backend code for proc. |
/PRGM/PROC$i/VHDL | Folder for generated VHDL files of proc. |
/PRGM/PROC$i/VHDL/RTL /PRGM/PROC$i/VHDL/CDTPKG |
RTL design for proc. Compound data structures package for proc. |
/PRGM/CG | Call graph for app. |
/PRGM/CG/DOT /PRGM/CG/PNG |
Graphviz dot representation for the call graph. PNG image visualization for the call graph. |
/PRGM/GLOBAL | Global miscellaneous files for app. |
/PRGM/GLOBAL/HERCSH /PRGM/GLOBAL/PROCNMS /PRGM/GLOBAL/BUILTINS /PRGM/GLOBAL/TESTDATA /PRGM/GLOBAL/MAINC /PRGM/GLOBAL/MAINH /PRGM/GLOBAL/MK /PRGM/GLOBAL/RSIMMK /PRGM/GLOBAL/RSIMDO /PRGM/GLOBAL/RSIMSH /PRGM/GLOBAL/XSTSH |
Generated HercuLeS bash script for app. Procedure names in app. Black box procedures in app. Reference I/O data for app. C driver code for app. C header code for app. Makefile for builting C backend code. Makefile for VHDL simulation using GHDL. .do script for VHDL simulation using Modelsim. Bash script for running the VHDL simulation. Bash script for running logic synthesis. |
/PRGM/VHDL | Folder for global VHDL files of app. |
/PRGM/VHDL/PKG /PRGM/VHDL/TB |
Global package for app. Testbench for app. |
HercuLeS supports user-defined configuration files for fast loading of high-level synthesis settings. A configuration file is expected to have the .config suffix. The supplied options are organized into four distinct categories, common (options that are common across all tools), nac2cdfg (passed to the HercuLeS nac2cdfg tool), cdfg2hdl (passed to the HercuLeS cdfg2hdl tool) and more where all miscellaneous options are defined. For the first three categories, these options represent command-line switches. The last category, more, defines options that are intentionally similar to command-line switches but are however passed to the HercuLeS GUI directly for controlling HercuLeS run script generation.
The general structure of a configuration file is shown below:
start-common <options> end-common start-nac2cdfg <options> end-nac2cdfg start-cdfg2hdl <options> end-cdfg2hdl start-more <options> end-more
An example of a typical configuration file invoking C backend file simulation, VHDL simulation and XST/ISE synthesis is shown below:
start-common end-common start-nac2cdfg -force-data-types -ssa -pseudo-ssa -emit-nac -emit-ansic -emit-cfg -emit-cg end-nac2cdfg start-cdfg2hdl -sched-asap -ieee -vhd2vl -mpint -mti end-cdfg2hdl start-more -optimizations -nacsopt -nacpeep -chain -rsim -csim -datagen -synth -emit-cdfg end-more
The list of command-line switches that are passed to the HercuLeS GUI is as follows:
During startup, the HercuLeS GUI automatically loads a predefined initialization file, named hercules.ini. This .ini file assigns all required environmental and internal variables for the proper setup of HercuLeS.
The initialization file is an ASCII text file comprising of a set of entries of the following form:
where variable is the name of the environmental or internal use variable to be set and rhs-string is the string value that is assigned to it.
The following table provides a brief summary of the variables that can be defined in initialization files.
Name | Description |
hlstop | Top-level installation directory for HercuLeS. |
compiler_name | Name of the host C compiler. |
compiler_path | Path to the executables' directory of the host C compiler. |
compiler_opts | Command-line options to pass to the host C compiler for generating GIMPLE dumps (only for gcc and gimple2nac). |
cfe_name | Name of the C frontend for compiling to NAC. |
cfe_path | Path to the C frontend executable. |
srcopt_path | Path to the C-to-C optimizer. |
xilinx_ise_path | Path to the Xilinx ISE directory. |
fpga_arch | FPGA architecture to be used for logic synthesis. |
fpga_part | FPGA device to be used for logic synthesis. |
graphviz_path | Path to the Graphviz installation. |
dotproc_name | Name of the Graphviz dot processor (default: dot). |
dotviewer_name | Name of the Graphviz file viewer (default: dotty). |
imgviewer_name | Name of the image viewer executable. |
imgviewer_path | Path to the image viewer executables' directory. |
pdfviewer_name | Name of the PDF viewer executable. |
pdfviewer_path | Path to the PDF viewer executables' directory. |
htmlviewer_name | Name of the HTML viewer executable. |
htmlviewer_path | Path to the HTML viewer executables' directory. |
An example initialization file is shown below:
hlstop="g:/hercules" compiler_name="gcc" compiler_path="/usr/local/bin/gcc-4.7.0-install/bin" compiler_opts="-Wall -O2 -fdump-tree-gimple-raw" cfe_name="gimple2nac" cfe_path="g:/hercules/gimple2nac" srcopt_path="g:/hercules/txlcopt" xilinx_ise_path="c:/Xilinx/12.3/ISE_DS/ISE" fpga_arch="virtex6" fpga_part="xc6vlx75t-ff484-1" graphviz_path="c:/CompSci/Graphviz2.28" imgviewer_path="g:/hercules/thirdparty/Imagine" imgviewer_name="Imagine" dotproc_name="dot" dotviewer_name="dotty" pdfviewer_path="c:/Program\ Files/Adobe/Reader\ 10.0/Reader" pdfviewer_name="AcroRd32" htmlviewer_path="c:/Documents\ and\ Settings/nkavvadias/Local\ Settings/Application\ Data/Google/Chrome/Application" htmlviewer_name="chrome"