Reconfigurable on-chip SIMD processor architectures for

Reconfigurable on-chip SIMD processor architectures for intelligent CMOS camera chips Dietmar Fey, Lutz Hoppe, and Andreas Loos Friedrich-Schiller-University Jena, Institute of Computer Science Ernst-Abbe-Platz 2, 07743 Jena, Germany {fey, hoppe, loos}@informatik.uni-jena.de Abstract We present results of an investigation concerning the appropriateness of different parallel SIMD architectures based on reconfigurable approaches for an integration in an one-chip high speed smart CMOS camera. The processing elements (PEs) of the architecture combine parallel analogue optical signal detection and parallel digital signal processing to meet real-time requirements. However, the parallel architecture puts some constraints on the PE architecture. To achieve reasonable pixel resolutions and fill factors the PE area has to be as low as possible. Additionally a single PE must also offer sufficient functional flexibility. We show by a logic synthesis that reconfigurable architectures based on morphological operations are the best solution to fulfill these constraints. Furthermore we present simulation results of a first test chip which we designed as an OPTO-ASIC with a simple SIMD chip architecture. steps would remain for a single pixel, what is not sufficient. Especially this holds also for ambitious tracking analysis tasks or motion detection. Serial sensor architectures consist of a CMOS photo detector matrix for signal detection, an analogue-to-digital-converter (ADC) for digitising, and a digital signal processor (DSP) for data processing (see Figure 1). The main bottleneck of this architecture presents the parallel-serial-converter between the ADC and the DSP, in particular for the processing of high resolutions. To reach fast processing rates parallel on-chip architectures are inevitably for high-speed cameras as it was also shown in other work [3]-[7]. Incoming light DSP 1. Introduction Due to technological advancement of integrated circuit technology CMOS (complementary metal-oxide semiconductor) sensor chips became the dominating standard for intelligent or smart sensor camera systems [1] in the last decade. In this context smart means that image capturing and image processing is carried out on one chip. Future industrial automation systems will be characterized by a co-operation of humans and robots. To meet security requirements in such systems reply times below 10 ms are required for the CMOS cameras deployed in a gripper arm of a robot. Traditional serial architecture approaches in current intelligent optical sensor chips will not be sufficient to achieve such high data rates. E.g. for the processing of an image with 640×480 pixels only about 32 ns can be spent for the processing of one pixel to meet the 10 ms reply time. Even with a 1 GHz microprocessor, whose use in a gripper arm is alone unimaginable due to the high power dissipation, only 32 elementary processor CMOS photo detectors ADC Figure 1. Serial architecture in smart CMOS cameras As solution we favour a parallel architecture, in which parallel signal detection and parallel signal processing with application-specific hard-wired or programmable parallel algorithms are integrated in one chip (see Figure 2). Since a lot of image processing algorithms are data-parallel a SIMD (single instruction multiple data) architecture is most suitable for the parallel optoelectronic on-chip processor. Each PE in such a parallel architecture consists of a photo detector, its own ADC and a corresponding logic. For the communication among the PEs we connect them with a NEWS network, i.e. each PE has a direct connection to its four orthogonal neighbor PEs. Furthermore the PEs and the photo detectors are equally spaced. Since in such an architecture each pixel has its own simple intelligence, one denotes such an approach sometimes also as smart pixel architecture. Logic Photo detector ADC Figure 2. Smart pixel architecture for CMOS cameras In spite of the benefits such a parallel smart pixel architecture offers, like fast processing and directly mapping of parallel algorithms onto the processor array, there are some drawbacks concerning the technical realization. Since each pixel PE has its own ADC we need more chip area than in other parallel architecture solutions which read out the image column-by-column and process the columns with a linear SIMD array. This additional necessary chip area leads to lower fill factors and lower pixel resolutions. To compensate these drawbacks we have to design the PEs as small as possible. However, a small PE area limits the functional capabilities. As we show, it is even though possible to realize efficient solutions by realizing the PE architecture with a reconfigurable approach [8]. The remainder of the paper is organized as follows. In section 2 we present a smart pixel architecture for a CMOS camera chip based on reconfigurable PEs. We evaluate this architecture by a comparison with a traditional programmable architecture with identical functionality. Section 3 shows implementation results of parallel sensor chips in which both a smart pixel approach and a solution based on a linear SIMD array were technologically realized. Finally we conclude the paper with a summary of the most important results. execution loop consisting of instruction and operand fetching, instruction execution and storing of results. Furthermore we restricted us to binary image processing in this first approach to save PE area. Additionally this simplifies the hardware realization since the ADC can be reduced to a simple comparator circuit in the case of binary image processing. The processor array was designed using a top-down approach, i.e. we started our design by an investigation of appropriate algorithms for the parallel calculation of edge detection, contour code, and the morphological base operations of erosion and dilatation [9]. We denote all these algorithms in the following as macro operations. These macro operations are based on so-called local operators what is the base for a parallel calculation. Local operator means that the calculation can be carried out in each pixel simultaneously by accessing the pixel value of the direct neighbor pixels, e.g. in a 3×3 surrounding. To carry out the mentioned macro operations in a parallel hardware it is necessary to reduce these operations to a sequence of more simple micro operations. We decided to select the morphological operations of erosion and dilatation within a four or eight pixel neighborhood as base for our hardware micro operations. For a binary image the erosion of a certain image point A(x, y) on a four pixel neighborhood, denoted as ferosion(4), is defined as the logical AND of the current pixel value A(x, y) and the pixel value of its direct neighbor pixels in the orthogonal direction (1). For an eight pixel neighborhood the values of the four intermediate directions in north west, in south west, in north east, and in south east are considered, too. ( 4) f erosion = A( x, y ) ∧ A( x − 1, y ) ∧ A( x + 1, y ) ∧ A( x, y − 1) ∧ A( x, y + 1) (1) 2. Architecture of the processor array As a first step towards a solution for a fast parallel sensor chip and to quantify the gain of a reconfigurable architecture we designed a smart pixel array based on a reconfigurable and a just programmable architecture. In fact both architectures are programmable, but the programmability in the reconfigurable architecture is based on the configuration of data paths in a combinatorial circuit. The programmable architecture realizes a traditional von-Neumann approach with registers, an ALU and an Erosion means that a black pixel, corresponding to the Boolean value true, is erased if and only if a single pixel within its neighborhood is white, i.e. its logical value is false. Dilatation means the opposite, i.e. a white pixel turns into black if at least one of its neighboring pixels is black. Consequently, to express dilatation by a Boolean formula all logical AND operators in (1) have to be changed in logical OR operators. Erosion and dilatation can be utilized for an edge detection. This requires a pixel-by-pixel AND operation of the original image img with the inverted eroded (2) or dilated image, respectively. Edge(img ) = img ∧ not (erode(img )) (2) The difference between an erosion-based and a dilatation-based edge detection is that an erosion-based edge detection recognizes the inner edge of an object whereas a dilatation-based edge detection recognizes the outer edge. In both cases noise, e.g. caused by reflections, can occur. This can be eliminated with help of the so-called open and close functions (3). The open function is defined as an erosion followed by a dilatation, whereas the close function is defined as a dilatation followed by an erosion. Open allows within an image to eliminate single points which are occurred e.g. by noise. Close can be used to fill gaps in the contour of an object. Open(img ) = dilate(erode(img )) Close(img ) = erode( dilate(img )) (3) Figure 3 shows a corresponding example for an eight pixel neighborhood. An original gray shaped image of three coins is digitized by a threshold value. This will be performed in our sensor chip within the analogue part which contains of a comparator circuit attached to a photo detector. Figure 3. Edge detection based on erosion (left), dilatation (middle) and close function (right) Dilatation can be reduced to an erosion and two inverting operations. In the case of binary image processing an erosion operation is nothing else than an AND operation. Hence, it is sufficient to implement a NOT and an AND operation in each PE of our processor array. Figure 4 config bit feedback config bit inv (0 ≡ id; 1 ≡ not) shows the corresponding data flow graph which is implemented in each PE of our SIMD architecture. The node pixel in Figure 4 corresponds to the digitized pixel value which is attached to a PE. The nodes left, right, top, and bottom correspond to the digitized pixel values of the four orthogonal neighbors PEs. These values will be transferred with the NEWS network in the SIMD array. As Figure 4 shows the data path contains a loop. Erosion, dilatation and edge detection are calculated in one pass, the execution of the more complicated morphological operations open and close require iterative passes. The configuring bit feedback determines by the multiplexor MUX_1 if pixel or the result of the previous pass is sent into the loop. The second configuring bit inv1 provides an inverting of pixel what is necessary if want to calculate a dilatation. With the successive gate AND_1 the conjunction of the current pixel with the left and right neighbor pixel value is carried out. Our hardware shall assist four pixel neighborhood as well as eight pixel neighborhood for the erosion operation. Depending on four or eight pixel neighborhood either the current pixel value or the AND operation result of current, left, and right pixel value have to be sent to the top and bottom neighbor PE. This is controlled by the third configuring bit 4_or_8_pix_neighbor and its corresponding multiplexor. Depending on this selection, which takes place in all PEs, either the current pixel value of the upper and lower PE or the conjunction result of the three nearest PE pixel values from the upper and lower row are received by the nodes top and bottom. Therefore the output of gate AND_2 is either the erosion of a four or eight bit neighborhood. With help of the fourth configuring bit edge_or_erode this erosion result can be inverted and combined with the current pixel value in gate AND_3 what corresponds to the detection of an edge according to (2). If we don’t want to determine the edge but the erosion edge_or_erode will select the by EXOR_2 not inverted output of gate AND_2 in MUX_3. At the end of the data path either the detected edge pixel or the erosion or config bit 4 or 8 pixel neighborhood Left Pixel MUX_1 2-to-1 EXOR_1 Right AND_1 MUX_2 2-to-1 Top DFF result CLK MUX_3 2-to-1 AND_3 EXOR_2 AND_2 Bottom config bit final_not to left neighbor to right neighbor config bit edge_or_erode to top neighbor to bottom neighbor Figure 4. Reconfigurable data path implementing erosion, dilatation, edge detection, open and close in one single PE dilatation result is stored in the flip-flop result. If we want to calculate open or close function a new iteration can start. Another possibility to realise the data flow graph is a programmable approach in which the data flow graph of Figure 4 is realized with a programmable simple ALU, a small register file, and an instruction decoder. In order to compare and evaluate the corresponding area values we implemented for both solutions corresponding architectures in SystemC and carried out a logic synthesis for a 0.6 µm standard CMOS process [9]. As result it turned out that the reconfigurable PE is much smaller (1722 µm²) than the programmable PE (3841 µm²). Hence, the reconfigurable approach would allow the integration of 480×480 PEs on 2×2 cm² chip area. A change to a more advanced CMOS processes will also allow the support of a VGA resolution. The control unit, which generates the configuring bits for the processor array, could be realized externally. Besides the global clock and the power supply only four further global control signals are necessary to carry out the configuration of the data paths in each PE. not yet completely measured at the moment of the writing of this paper. The implemented algorithms in this architecture allow to calculate the following four operations: edge detection, erosion, dilatation and calculation of a contour code applied on binary images. This architecture does not contain already the reconfigurable data path of Figure 4. Instead of that all PEs in this architecture calculate all four operations simultaneously in parallel data paths. The desired operation is selected by a two-bit global control signal which is routed to all PEs. The PE layout was generated by a logic synthesis of a VHDL specification and a subsequent layout synthesis. Minimised combinatorial logic is realized in each PE to receive a small number of gates. In all 19 gates were generated for the whole functionality. These 19 gates were placed and routed automatically within an area of 120×120 µm² (see Figure 6). The processor array itself had to be designed by manual placement. All PEs are connected by a NEWS network. 3. Technological Realization As a first prototype to test the technology and to show the principal feasibility we designed a layout for a parallel sensor chip with an 8×8 array (see Figure 5). test structure photo diode digital logic Figure 6. Layout of a single PE of the smart pixel array. Left the photodiode, on the right side the digital PE logic. A simulation of the digital part of the PE on layout level with an extracted SPICE specification resulted in a possible clock frequency of 526 MHz. To test the functionality of the whole chip we started the simulation with a photo current as input and ended by verifying the output on an external pad. We achieved a correct operating up to a frequency of 200 MHz. The optical receiver in each PE is formed by a PIN photodiode with a diameter of 40 µm, a transimpedance amplifier and a decision-making postamplifier. The photodiodes are optimised for wavelength of 650 nm up to 850 nm. The receiver will work properly in a range between 25 µW to 500 µW. For optical tests with fibres we have used a defined pitch of 250 µm for the photodiode structures. This allows us to couple the chip with a fibre array. In this way we can test the optical functionality of our chip. More details concerning the implementation of the digital and analogue part as well as the comparator circuit can be found in [10] and [11]. Figure 5. Layout of the parallel smart pixel sensor chip The analogue part was designed by the Institute of Electrical Measurement and Circuit Design of Vienna University of Technology, the digital part was designed by us [10]. The chip was realized with a XB06P3-BiCMOS technology of XFAB company in Erfurt, Germany. The whole chip data rate was determined with 400 MBit/s. This result is based on simulation because the chip was The photo detectors of the architecture of Figure 5 were optimized for fast optical data links, e.g. between neighbored boards. It was our intention to test these optoelectronic detectors within a mixed-signal circuit which contains a real application. Therefore we decided to integrate the parallel smart pixel sensor architecture for binary image pre-processing we described above. However, for a real sensor architecture with a pixel resolution and a fill factor as high as possible it is easier to realize the parallel processor array with a linear SIMD array instead of a smart pixel architecture. Therefore we designed together with the Endowed Chair for Neural Circuits and Parallel VLSI-Systems of Dresden University of Technology a further parallel sensor chip. Again the analogue part was realized by our collaboration partner and the digital part by us. The architecture of this chip is organized as follows. The sensor matrix is not interlaced with logic. Moreover, the sensor array with a size of 128×128 pixels consists of a single block. In periodic time steps a whole pixel row is shifted out of the sensor array. In a serial processing the photo currents of each pixel of one row are converted one after another by an eleven bit ADC. The eight MSBs (most significant bits) are temporarily stored in a shift register. This register contains the digitized values of a pixel row. The register values are read in by a linear SIMD array of PEs which process the digitized pixel data row-by-row in a pipelined mode. With an externally controllable threshold value the data is binarized by a subtraction. Afterwards we have a binary image which is processed with similar pre-processing algorithms as described above. We intend to integrate in future designs of this chip an extended version of the reconfigurable data path shown in Figure 4. The experience we gained by the design and the measuring of this sensor chip and the chip of Figure 5 mean important preliminary work for this purpose. paths is a better choice than a traditional programmable von-Neumann architecture. The reconfigurable architecture requires two times lower area than a programmable architecture with same functionality. With a standard two metal-layer 0.6 µm CMOS process we can achieve a resolution of about 240×240 pixels on 1 cm² chip area. Using a more advanced process technology and a manually optimized compact layout, e.g. based on complex gates, it should be unproblematic to integrate a VGA resolution. Concerning a future aspired realization of such a circuit we can highly profit from experiences we made by designing other parallel SIMD chip architectures, e.g. for fast optical data links and for a parallel non smart pixel sensor architecture. 5. References [1] W. Wolf, B. Ozer, T. Lv, “Smart Cameras as Embedded Systems“. IEEE Computer, September 2002, pp. 48-53. [2] E. Fossum, “Digital Camera System on a Chip“, IEEE Micro, pp. 8-15, May/June 1998. [3] R.P. Kleihorst et.al., “Xetal a low-power high-performance smart camera processor“. Proc. ISCAS2001, Sydney, Australia, 2001. [4] F. Paillet, D. Mercier, T.M. Bernard, and E. Senn, “Second Generation Programmable Artificial Retina”, Proceedings IEEE ASIC/SOC Conf., pp.304-309, Sept. 1999. [5] J.G. Gealow, and C.G. Sodini, “A Pixel-Parallel Image Processor Using Logic Pitch-Matched to Dynamic Memory”, IEEE Journal of Solid-State Circuits, 34 (6), pp. 831-839, June 1999. [6] S. Kleinfelder, S. Lim, X. Liu, A. El Gamal, “A 10 000 Frame/s CMOS Digital Pixel Sensor“, IEEE Journal of SolidState Circuits, Vol. 36, No. 12, pp. 2049-2059, December 2001. [7] P. Dudek, “A programmable focal-plane analogue processor array”, Ph.D. Thesis, UMIST, Manchester, May 2000. [8] J. Villasenor, W.H. Mangione-Smith, “Configurable Computing“, Scientific American, Vol. 276, No 6, June 1997, pp.54-59. [9] D. Schmidt, “Entwicklung einer rekonfigurierbaren Hardware zur parallelen, monochromen digitalen Bildvorverarbeitung“. Diploma thesis, Institute of Computer Science, University Jena, January 2004. [10] D. Fey, L. Hoppe, A. Loos, M. Förtsch, H. Zimmermann, “Parallel optical interconnects with mixed-signal OEIC and fibre arrays for high-speed communication“.Proceedings of SPIE, Vol. 5453, Photonics Europe, Strasbourg, April 2004. [11] D. Fey, L. Hoppe, A. Loos, „Reconfigurable optoelectronic interconnects for VLSI circuits based on fibre arrays“, Recent Research Development in Optics, 3, 205-222, Research Signpost, ISBN: 81-271-0028-5, 2003. 4. Conclusion We investigated different architectures approaches for the integration of a parallel SIMD processor array in an one-chip high speed CMOS camera. The task of the SIMD processor is to execute simple image pre-processing tasks. A parallel architecture is favored in order to meet real-time requirements. The core of the parallel architecture approach is to combine parallel analogue optical signal detection and parallel digital signal processing based on morphological operations within a parallel smart pixel array. To support the realization of high resolution and reasonable fill factors the PEs in the processor array have to be as small as possible. At the same time the PEs shall offer sufficient functional flexibility. Due to the results of a carried out logic synthesis we found out that a 2-D array of PEs with reconfigurable data

Related docs
premium docs

Other docs by hijuney7
Owner s checklist for starting a new business
Views: 1387  |  Downloads: 147
Articles of Incorporation California
Views: 497  |  Downloads: 16
Jetblue Airways Inc Ammendments and Bylaws
Views: 253  |  Downloads: 2
Ethical Standards Code
Views: 362  |  Downloads: 17
Shareholder Resolution Appointing Directors
Views: 688  |  Downloads: 16
Remedies Skeleton Outline
Views: 907  |  Downloads: 75
Checklist for Issuing Stock
Views: 1125  |  Downloads: 45
Form 1040A U S Individual Income Tax Return
Views: 844  |  Downloads: 4
at105
Views: 182  |  Downloads: 0
COMPLAINT FOR INJUNCTIVE RELIEF
Views: 302  |  Downloads: 6
Employment Agreement For Technical Employees
Views: 428  |  Downloads: 10
Homeopathic Questionnaire for Case Taking
Views: 1209  |  Downloads: 54